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NEURAL  NETS  FOR  SCENE  ANALYSIS 

CHAPTER  1:  INTRODDCTfON 

This  project  involved  various  new  optical  and  digital  neural  net  techniques  for  scene  analysis. 
The  original  neural  net  concept  was  the  adaptive  clustering  neural  net  (ACNN).  This  is  detailed  in 
Chapter  2.  Our  original  associative  processor  concept  was  the  Ho-Kashyap  neural  net.  This  is 
detailed  in  Chapter  3.  Our  overview  of  how  neural  nets  should  be  used  in  scene  analysis  is  detailed 
in  Chapter  4.  This  also  includes  an  overview  of  our  two  new  higher  order  neural  nets.  Our  new 
PQNN  neural  net  (which  produces  higher-order  decision  surfaces  much  more  efficiently  than  other 
neural  nets)  is  noted  in  Chapter  5.  To  achieve  high  performance  on  systems  with  components  with 
analog  accuracy  and  various  nonidealities,  we  developed  a  new  algorithm  and  technique  discussed 
in  Chapter  6.  We  have  fabricated  our  optical  laboratory  neural  net  and  tested  it  on  several  different 
case  studies  and  achieved  excellent  results  as  noted  in  Chapter  7. 
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CHAPTER  2 


Adaptive-clustering  optical  neural  net 


David  P.  Casasent  and  Etienne  Barnard 


Pattern  recognition  techniques  (for  clustering  and  linear  discriminant  function  selection)  are  combined  with 
neural  net  methods  (that  provide  an  automated  method  to  combine  linear  discriminant  functions  into 
piecewise  linear  discriminant  surfaces).  The  resulting  adaptive-clustering  neural  net  is  suitable  for  optical 
implementation  and  has  certain  desirable  properties  in  comparison  with  other  neural  nets.  Simulation 
results  are  provided. 


I.  introduction 

Artificial  neural  networks  have  received  much  re¬ 
cent  attention'"®  and  various  optical  realizations^  ^  of 
the  classic  backpropagation  neural  network^  have  been 
suggested.  Various  other  optical  neural  network  ar¬ 
chitectures  have  been  described’"®  and  some'®"'®  have 
been  demonstrated  conceptually.  In  this  paper  we 
distinguish  between  optimization  and  adaptive  learn¬ 
ing  neural  networks  (Sec.  II)  and  we  discuss  various 
neural  net  issues  as  background.  We  then  advance  a 
new  adaptive-clustering  neural  network  (ACNN)  in 
Sec.  III.  Simulation  results  (performed  on  a  Hecht- 
Nielsen  Corporation  electronic  neural  network)  are 
then  presented  (Sec.  IV),  optical  realizations  of  the 
ACNN  are  discussed  (Sec.  V)  and  a  summary  is  ad¬ 
vanced  (Sec.  VI).  This  ACNN  uses  a  new  learning 
algorithm  that  combines  standard  pattern  recognition 
techniques  and  neural  net  concepts  to  arrive  at  a  new 
and  quite  useful  method  for  neural  network  synthesis 
that  can  be  achieved  optically  with  attractive  results 
and  potential. 

II.  ArUTicial  Neural  Networks 

We  distinguish  between  two  main  classes  of  neural 
networks'^-'®:  optimization  neural  nets  and  adaptive 
learning  neural  nets.  OptimizaUon  neural  nets  are 
well  understood  and  their  basic  theory  is  well  estab¬ 
lished.  '®  '’  Associative  processors  are  another  class  of 
neural  networks*®  '®"®'  that  are  also  well  understood. 
In  this  paper  we  consider  adaptive  learning  neural 
nets.  The  major  advantage  of  a  neural  net  in  multi¬ 
class  pattern  recognition  is  its  ability  to  compute  non- 
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Itnear  decision  surfaces  (typically  combinations  of  lin¬ 
ear  decision  surfaces)  for  complex  multiclass  decision 
problems.  In  fact,  many  neural  net  classifiers  can 
create  decision  boundaries  of  arbitrary  shape.  Our 
proposed  neural  net  uses  this  feature  of  neural  nets  in 
conjunction  with  initial  weights  selected  using  class 
prototypes  of  clusters — hence  we  refer  to  this  as  an 
adaptive-clustering  neural  net.  It  employs  a  three¬ 
layered  architecture,  consisting  of  input,  hidden,  and 
output  layers  with  interconnections  between  the  input 
and  hidden  layers,  and  between  the  hidden  and  output 
layers. 

A.  Neixon  Representation  Spaces  and  Dimensionality 

To  maintain  a  reasonable  number  of  input  (Pi)  neu¬ 
rons,  we  recommend'^’*®  that  the  neuron  representa¬ 
tion  space  be  an  appropriate  feature  space.  For  image 
recognition  applications,  the  feature  space  should  not 
be  pixel-based.  Other  feature  spaces  have  the  addi¬ 
tional  advantage  that  they  can  be  made  invariant  to 
transformations  such  as  in-plane  rotations.  This 
greatly  reduces  the  number  of  training  images  re¬ 
quired  (i.e.,  we  need  not  train  on  transformed  versions 
of  the  objects  to  be  identified) .  For  an  Af -dimensional 
feature  space,  we  use  M  +  1  input  neurons.  The 
additionid  neuron  is  used  to  incorporate  the  threshold 
of  the  hidden  layer  neurons  into  the  input  vector  with 
the  state  of  this  neuron  set  to  unity.  We  now  detail 
this.  A  linear  discriminant  function  (LDF)  in  a  fea¬ 
ture  space  described  by  feature  vectors  x  can  be  writ¬ 
ten  as 

g{x)  -  m'x  +  uij.  (U 

where  w  defines  the  orientation  of  the  linear  decision 
boundary  and  wo  defines  its  offset  or  location.  When 
decisions  depend  on  whether  £  %  0,  then  — u/y  is  the 
threshold  for  the  vector-inner  product  (VIP)  w'x.  By 
adding  an  additional  1  to  the  feature  vector  x  to  pro¬ 
duce  y,  we  include  lOo  in  w  and  we  can  now  write  Eq.  ( 1 ) 
as 

j;(x)  =  »•  y.  r.'l 


10  June  1990  /  Vol.  29.  No  17  /  APPLIED  OPTICS 


2603 


The  number  of  neurons  in  layer-two  (hidden  layer) 
is  generally  chosen  empirically.  The  number  of  hid¬ 
den  layer  neurons  determines  the  complexity  of  the 
decision  surface.  Thus,  too  few  neurons  lead  to  poor 
classification  performance,  since  a  decision  surface  of 
complexity  sufficient  to  separate  the  various  classes 
cannot  be  created.  In  most  neural  nets,  the  use  of  too 
many  hidden  neurons  is  wasteful  of  resources  and 
leads  to  poor  generalization.  By  this  we  mean  that  the 
decision  surfaces  are  adapted  to  the  peculiarities  of  the 
training  set. 

Local  minima  are  a  frequent  topic  of  discussion  asso¬ 
ciated  with  the  number  of  hidden  neurons  used.  A 
local  minimum  is  a  value  of  the  energy  function  that  is 
a  minimum  in  a  local  region,  rather  than  being  a  global 
minimum.  In  training  a  backpropagation  (BP)  neural 
net,®  the  initial  state  of  the  hidden  layer  neurons  is 
random  and  a  given  error  rate  and  some  energy  is 
obuuned.  When  training  is  repeated  with  different 
initial  hidden  neuron  states,  if  a  different  error  rate 
results,  a  local  minimum  exists.  One  must  vary  the 
number  of  hidden  rieurons  and  retrain  with  different 
initial  conditions  to  empirically  determine  the  number 
of  hidden  neurons.  The  presence  of  such  variables 
results  in  long  training  times  for  neural  nets  (as  various 
numbers  of  layer-two  neurons  and  various  starting 
conditions  are  tried)  and  it  can  result  in  a  neural  net 
that  cannot  easily  be  generalized  to  test  data. 

Local  minima  occur  when  hidden  neurons  become 
redundant  during  training  (e.g.,  two  of  the  N  hidden 
neurons  encode  decision  boundaries  that  lie  very  close 
to  one  another).  If  each  neuron  encoded  a  distinct 
decision  boundary,  a  lower  error  rate  would  result  (if 
the  number  of  neurons  were  too  few).  When  the  num¬ 
ber  of  distinct  hidden  neurons  is  sufficient  (equal  to  or 
greater  than  the  minimum  required),  there  is  no  effect 
on  classification  performance,  since  sufficiently  com¬ 
plex  decision  surfaces  can  be  created  despite  redun¬ 
dancies  in  the  hidden  neurons.  Thus,  in  this  case  local 
minima  are  not  of  concern.  Many  researchers  have 
found  that  extensive  methods  to  prepuce  100%  classifi¬ 
cation  on  training  data  are  not  merited,  since  test  set 
performance  often  does  not  reflect  such  improved 
training  set  results.  Recent  work*^  on  the  choice  of 
the  number  of  hidden  neurons  has  concentrated  on  the 
case  when  the  training  samples  are  in  random  positions 
in  the  feature  space,  which  is  almost  never  the  case  in 
real  pattern  recognition  problems. 

Thus,  although  local  minima  are  not  of  major  con¬ 
cern,  an  alternate  technique  to  determine  the  number 
of  hidden  neurons  with  significantly  reduced  effort  is  a 
significant  concern.  Our  new  neural  net  addresses 
this  issue  by  an  organized  procedure  that  selects  the 
number  of  hidden  neurons  based  on  the  number  of 
clusters  present  in  the  multiclass  data  to  be  separated 
(as  detailed  in  Sec.  III). 

The  number  of  neuron  layers  used  is  another  vari¬ 
able.  For  BP,  it  has  been  shown-^-^^  that  any  deci.sion 
surface  can  be  approximated  to  arbitrary  accuracy 
with  a  three-layer  neural  net.  Four-layer  neiir.il  nets 
can  also  produce  any  such  decision  surface,  but  they 


are  harder  to  train  (since  the  Hessian  of  the  criterion 
function  with  respect  to  the  weights  is  more  ill-condi¬ 
tioned  when  more  layers  are  used)  and  they  generally 
introduce  more  parameters  that  must  be  empirically 
selected.  Since  our  neural  net  also  approximates  any 
such  decision  boundary  with  three  layers,  we  restrict 
attention  to  a  three-layer  neural  net. 

The  number  of  output-layer  neurons  equal  the  num¬ 
ber  of  classes. 

8.  Criterion  or  Error  Functions 

One  of  the  most  popular  adaptive  learning  neural 
nets  is  backpropagation  (BP).®  The  problems  with 
this  neural  net  are  that  it  requires  a  large  training  set 
and  long  training  time,  and  does  not  necessarily  con¬ 
verge  to  the  best  minimum.  Backpropagation  is  an 
example  of  a  neural  net  which  is  trained  by  the  minimi¬ 
zation  of  an  error  or  criterion  function.  The  form  of 
the  error  function  that  is  minimized  for  such  nets  can 
affect  performance  and  training  time  (e.g.,  the  error 
function  with  the  best  error  rate  is  often  the  one  for 
which  it  is  most  difficult  to  reach  a  minimum  error^®). 
Standard  BP  uses  an  error  function  based  on  a  sigmoid 
transfer  function,  while  our  ACNN  uses  the  percep- 
tron  error  function  in  training.  We  recently  provid¬ 
ed^  a  comparison  of  various  error  or  criterion  func¬ 
tions.  It  was  shown  that,  in  general,  the  use  of  a 
perceptron  criterion  function  provides  faster  conver¬ 
gence  with  comparable  error  rates  P,  to  those  obtained 
with  the  more  popular  sigmoid  criterion  function. 
The  error  function  choice  is  not  of  major  concern  in  the 
performance  of  BP  and  our  ACNN  (it  is  included  to 
note  the  differences  between  BP  and  ACNN  and  be¬ 
cause  the  criterion  function  used  specifies  the  type  of 
linear  classifier  employed,  as  we  detail  in  Sec.  III). 

C.  Update  Algorithm 

One  reason  for  the  slow  convergence  of  BP  is  that  a 
gradient-descent  (delta  rule)  algorithm  is  often  used  to 
update  the  weights  in  training.  Our  ACNN  uses  a 
conjugate-gradient  algorithm*"^  for  weight  update 
since  it  is  faster  and  does  not  require  the  empirical 
choice  of  parameters  such  as  the  learning  rate  and 
momentum.®®^  In  conjugate -gradient  updating,  all 
of  the  training  set  data  are  fed  to  the  system  (once)  and 
then  the  weights  are  updated.  Conversely,  with  gradi¬ 
ent  descent  the  weights  can  be  updated  after  the  pre¬ 
sentation  of  each  sample  in  the  training  set.  A  batch 
type  of  gradient-descent  algorithm  can  also  be  used, 
with  weights  updated  only  after  all  training  data  have 
been  presented  to  the  system  once.  Generally,  batch 
gradient  descent  has  the  slowest  convergence  (since 
the  parameters  cannot  be  updated  and  selected  at 
different  steps).  Sequential  (nonbatch)  gradient  de¬ 
scent  generally  performs  better  than  batch  gradient 
descent,  since  it  makes  more  steps  toward  the  solution 
(in  one  presentation  of  the  training  set  of  data).  How¬ 
ever.  selection  of  its  parameters  is  empirical  and  we 
have  found  that  conjugate-gradient  optimization  per¬ 
forms  better.  We  attribute  this  to  the  f.ict  that  coiiju- 
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gate-gradient  optimization  adapts  the  learning  param¬ 
eters  in  a  sensible  way,  whereas  these  parameters  are 
kept  fixed  or  adapted  heuristically  for  gradient  de¬ 
scent. 

In  difficult  multiclass  decision  problems  we  have 
found  conjugate-gradient  training  to  be  much  more 
efficient  than  gradient  descent  With  neural  net  hard¬ 
ware  and  software  (such  as  the  Hecht-Nielsen  Corp. 
AZP  which  we  use)  conjugate-gradient  optimization  is 
very  attractive.  In  our  comparisons  of  BP  and  the 
ACNN  we  use  the  same  conjugate-gradient  algorithm 
to  update  the  weights. 

D.  Initial  Weights 

Another  reason  for  the  long  training  time  for  BP  is 
that  the  initial  weights  are  chosen  arbitrarily.  In  our 
ACNN  algorithm,  the  initial  weights  are  set  using  pat¬ 
tern  recognition  techniques  and  then  they  are  refined 
using  neural  network  techniques.  This  is  a  major  rea¬ 
son  for  the  improved  performance  of  our  ACNN.  We 
have  tested  BP  using  initial  weights  chosen  from  clus¬ 
tering  techniques  similar  to  those  used  for  the  initial 
weights  of  the  ACNN.  We  found*®  negligible  im¬ 
provement  in  training  time  and  worse  performance  in 
some  cases.  We  attribute  this  to  the  fact  that  BP  can 
sometimes  use  hidden  neurons  in  more  sophisticated 
ways  than  is  the  case  in  the  hidden  layer  of  our  ACNN 
and  that  this  cannot  be  achieved  when  a  preset  weight 
choice  is  used. 

This  present  section  was  intended  to  highlight  issues 
associated  with  neural  networks  and  to  note  differ¬ 
ences  between  our  algorithm  and  the  more  extensively 
tested  and  analyzed  BP  algorithm. 

HI.  Adaptive  Clustering  Neural  Net  (ACNN)  Training 
Algorithm 

Our  three-layer  ACNN  is  shown  in  Fig.  1.  It  is 
similar  to  the  standard  multilayer  perceptron.  We 
now  detail  its  design  and  use  for  multiclass  pattern 
recognition.  The  input  (Pi)  neurons  are  analog  and 
represent  a  feature  space  which  can  be  of  low  dimen¬ 
sionality  (we  add  an  additional  feature  which  is  always 
kept  at  unity  to  adapt  the  threshold  of  the  hidden 
neurons  as  well).  The  hidden  layer  neurons  at  P2 
correspond  to  clusters  in  feature  space,  with  several 
clusters  (neurons)  used  for  each  class  in  a  multiclass 
application.  The  P1-P2  weights  are  used  to  assign  an 
input  to  a  cluster.  We  typically  use  two  to  five  clusters 
per  class.  The  layer-two  neurons  are  binary  and  (in 
testing)  the  P2  neuron  with  the  largest  input  activity 
fires  and  denotes  the  cluster  to  which  the  input  be¬ 
longs.  During  training  the  P1-P2  weights  adapt  as  we 
will  detail  (we  employ  a  conjugate-gradient  algorithm) 
and  thus  refine  our  initial  weight  estimates.  The  hid¬ 
den  layer-to-output  weights  are  fixed  (all  are  either 
zero  or  one)  and  perform  the  mapping  of  the  Pj  clusters 
to  one  of  the  classes  (with  one  P3  neuron  assigned  per 
class  of  data).  Thus,  we  initially  assign  several  layer- 
two  cluster  neurons  to  each  class  and  use  fixed  P2-P3 
iveights  to  assign  each  P-  cluster  to  a  finr^l  class  (output 
aeuron  in  /*,).  This  is  attractive  and  new  since  it 
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Fig.  1  Adaptive-clustering  neural  net. 


allows  us  to  use  standard  clustering  and  pattern  recog¬ 
nition  techniques  to  select  the  initial  P1-P2  weights 
(initial  LDFs)  and  new  neural  net  techniques  to  adapt 
or  refine  these  weights.  We  employ  a  perceptron  crite¬ 
rion  or  error  function  (this  defines  our  LDFs)  rather 
than  a  sigmoid  error  function,  since  faster  convergence 
with  a  comparable  error  rate  is  obtained. 

There  are  no  commonly  used  standard  (non-neural 
net)  techniques  to  obtain  piecewise  linear  decision  sur¬ 
faces  for  two-  or  multicla^  problems  (except  nearest- 
neighbor  methods).  Because  of  the  importance  of 
neural  net  techniques  in  addressing  this  problem,  and 
since  we  use  nearest-neighbor  techniques  in  selecting 
our  clusters,  we  briefly  review  standard  multiclass 
techniques.  In  a  nearest-neighbor  classifier,  the  dis¬ 
tance  ^tween  an  input  and  all  training  samples  is 
calculated  and  the  input  is  assigned  to  the  class  of  the 
closest  training  sample.  From  tests  on  all  training 
data  in  each  class,  the  bounds  on  each  class  are  deter¬ 
mined  and  one  can  obtain  piecewise  linear  decision 
surfaces.  However,  the  nearest-neighbor  technique  is 
computaUonally  intensive  (requiring  calculation  of 
the  distance  to  all  training  samples).  Conversely,  neu¬ 
ral  nets  have  a  long  training  time  (which  is  off-line  and 
of  less  concern)  but  their  classification  times  (an  on¬ 
line  requirement)  are  short.  In  addition,  all  training 
samples  must  be  stored  for  a  nearest-neighbor  system 
and  thus  storage  requirements  can  be  excessive.  Fi¬ 
nally,  nearest-neighbor  systems  do  not  perform  well 
when  the  probability-density  functions  of  the  classes 
overlap  significantly.  The  c^culation  of  the  K  nearest 
neighlMrs  is  useful  here  (the  input  is  assigned  to  the 
cla^  to  which  the  majority  of  these  K  samples  belong). 
However,  the  selection  of  K  is  empirical. 

Two  other  multiclass  techniques  are  Gaussian  and 
linear  classifiers.  Gaussian  classifiers  assume  that  the 
data  in  each  class  are  normally  distributed  and  for  each 
class  its  mean  and  variance  are  estimated.  To  classify 
an  input  vector,  a  posteriori  probabilities  are  calculat¬ 
ed  for  each  class  with  Bayes’  rule,  and  the  input  is 
assigned  to  the  class  with  the  highest  probability. 
This  technique  (and  all  parametric  methods)  work 
only  if  the  data  follow  the  assumed  distribution  and 
this  is  rarely'  the  case.  To  produce  multiclass  decision 
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boundaries  with  LDFs,  the  mean  vector  of  each 
class  can  be  calculated  and  used  as  an  LDF.  The  VIP 
of  the  input  with  each  me  and  thresholding  denotes  the 
class  estimate  for  the  input.  Criterion  functions  (error 
functions)  represent  a  preferable  way  to  select  an  LDF 
for  each  class.  One  can  employ  pairwise  LDFs  (for 
each  LDF,  some  class  i  is  compared  with  another  class 
j).  These  approaches  are  computationally  intensive 
and  not  attractive  for  problems  with  many  classes  and 
they  may  lead  to  decision  surfaces  that  have  undefined 
regions  (not  corresponding  to  any  class). 

Thus,  standard  linear  discriminant  techniques  for 
multivariate  pattern  recognition  allow  us  to  determine 
suitable  linear  discriminants,  but  these  are  generally 
not  powerful  enough  for  realistic  pattern  recognition 
applications  that  require  nonlinear  decision  surfaces. 
In  our  ACNN,  neur^  net  techniques  provide  refine¬ 
ments  to  the  linear  discriminant  weight  estimates  and 
automatically  combine  many  linear  decision  bound¬ 
aries  into  piecewise  linear  decision  boundaries.  We 
now  detail  the  design  and  update  rules  for  our  ACNN. 

A.  Selection  of  the  Number  of  Hidden  Layer  (Cluster) 
Neurons 

To  select  the  prototypes/exemplars  or  cluster  repre¬ 
sentatives  we  use  two  steps.  As  our  prototypes  we 
desire  the  N  prototypes  in  the  training  set  whose  re- ' 
moval  cause  the  most  error  in  a  nearest-neighbor  clas¬ 
sification.  We  assume  a  large  training  set  (Nr  sam¬ 
ples)  for  our  multiclass  problem  (so  large  that  simple 
clustering  techniques  cannot  produce  a  suitable  set  of 
clusters).  We  first  use  standard  techniques^  for  sam¬ 
ple-number  reduction  to  obtain  a  modest  number  of 
prototypes  Nr.  This  reduced  nearest-neighbor  clus¬ 
tering  technique  divides  the  Nt  samples  into  two 
groups  (A  and  B),  where  the  samples  in  A  classify  all 
Nt  samples  correctly  using  a  nearest-neighbor  tech¬ 
nique.  Initially,  all  samples  are  in  group  B.  The 
samples  in  A  are  used  as  the  prototypes  in  a  nearest- 
neighbor  classifier.  Each  sample  in  B  is  sequentially 
presented  to  the  nearest-neighbor  classifier.  If  it  is 
incorrectly  classified,  it  is  added  to  A.  This  procedure 
is  repeated  until  the  samples  in  group  A  can  correctly 
classify  all  Nt  samples.  (Typically  around  5%  to  30% 
of  the  training  samples  are  still  present  in  Nr  and  this 
is  still  too  large  t  number  of  P2  neurons.) 

Thus,  we  employ  a  second  step  to  further  reduce  the 
number  of  protot3q>es  (clusters)  to  an  acceptable  num¬ 
ber  N.  To  achieve  this,  we  remove  the  first  prototype, 
use  the  remaining  Nr  —  1  samples  in  a  nearest-neigh¬ 
bor  classifier  to  classify  the  Nt  original  samples  and 
calculate  the  number  of  misclassifications.  We  then 
remove  only  the  second  prototype  and  repeat  the 
above  procedure  with  the  remaining  Nr  —  1  samples. 
This  procedure  continues  until  the  removal  (separate¬ 
ly)  of  each  of  the  Nr  prototypes  has  been  tested.  If  N 
is  prespecified,  we  keep  the  N  prototypes  whose  re¬ 
moval  would  cause  the  most  errors.  We  can  also  use 
the  number  of  errors  obtained  by  removing  each  proto¬ 
type  to  select  N  (i.e.,  we  select  N  that  results  in  no  more 
than  a  given  error  rate  or  for  which  there  is  a  jump  in 


the  number  of  errors  produced).  We  insure  that  at 
least  one  prototype  is  chosen  from  each  class.  Insur¬ 
ing  that  we  keep  one  prototype  per  class  has  not  been  a 
problem  in  our  benchmarks  (i.e.,  if  the  prototypes  are 
ordered  by  their  error  rate,  we  do  not  find  a  number  of 
consecutive  prototypes  in  one  class  before  one  from 
another  class  occurs).  In  our  initial  benchmarks,  we 
have  not  found  significant  branch  points  or  jumps  in 
the  error  rates  of  the  ordered  samples.  There  is  also  no 
restriction  that  the  same  number  of  prototypes  be 
selected  from  each  class  (the  data  will  determine  this). 
Ck)nsiderable  flexibility  is  possible  in  how  the  N  proto¬ 
types  are  selected  since  training  will  refine  the  initial 
choices;  therefore,  this  issue  is  not  of  major  concern. 

This  procedure  does  not  account  for  the  fact  that, 
when  several  samples  are  not  included  as  prototypes, 
performance  will  be  worse  than  when  only  one  of  the 
samples  is  omitted.  However,  the  purpose  of  selecting 
prototypes  (or  cluster  representatives)  is  only  to  pro¬ 
vide  a  reasonable  or  approximate  initial  selection  (the 
neural  net  adaptations  of  these  initial  choices  address 
the  global  problem). 

We  note  that  use  of  a  nearest-neighbor  technique  for 
training  is  acceptable,  but  it  is  not  suitable  for  classifi¬ 
cation  (where  on-line  real  time  requirements  exist). 
The  combination  of  our  nearest-neighbor  prototype 
selection  and  ACNN  update  algorithm  will  be  shown  to 
require  fewer  iterations  than  BP.  To  quantify  the 
significance  of  this,  we  now  briefly  address  the  number 
of  operations  required  to  select  prototypes  and  relate  it 
to  the  number  of  operations  required  in  one  BP  itera¬ 
tion  on  all  Nt  training  samples.  For  each  sample,  our 
prototype  selection  algorithm  must  calculate  the  dis¬ 
tance  to  all  other  points  in  the  training  set.  For  all  Nt 
samples,  the  claculation  of  the  distances  from  all 
points  to  all  points  (i.e.,  the  number  of  distance  calcu¬ 
lations  required  for  one  pass  through  the  Nt  training 
seunples)  is  approximately  0.5N^  (we  precalculate  this 
once  and  use  the  0.5  factor  since  the  calculations  are 
symmetric).  In  BP,  all  Nt  samples  are  presented  and 
after  each  sample  we  must  calculate  the  activities  of  all 
N  neurons  (N  hidden  neurons  are  assumed  and  the 
calculation  of  the  activities  of  the  output  neurons  is 
ignored),  i.e.,  NtN  calculations  are  required.  The  cal¬ 
culation  times  for  the  operations  in  the  two  cases  are 
equivalent,  each  is  a  VIP  of  dimension  equal  to  that  of 
the  feature  space  used  (the  calculation  times  for  each 
operation  are  exact  for  the  case  of  layer -one  and  layer- 
two  neurons).  If  the  additional  number  of  BP  itera¬ 
tions  required  is  I,  for  our  algorithm  to  be  computa¬ 
tionally  efficient,  we  require 

O.SA'j.  <  (3) 

Since  Nt  »  Nour  algorithm  may  not  offer  a  significant 
advantage  in  training  time  (once  N  is  fixed  in  BP) 
unless  I  is  very  large. 

In  obtaining  the  result  in  Eq.  (3),  we  assumed  that  all 
Nt  samples  were  used  in  selecting  the  N  prototypes. 
We  have  found  that  we  need  only  use  approximately 
[)N  randomly  selected  samples  from  the  full  set  of  Nr 
in  our  prototyjie  selection  (N  is  the  number  of  proto- 
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es  or  cluster  neuron  used  at  Py  and  we  have  always 
nd  that  two  to  five  neurons  per  class  suffice), 
us,  we  employ  our  algorithm  using  samples  (not 
).  The  inequality  to  be  satisfied  is  now 

>Ar  <  2NtI- 

further  evaluate  this,  we  assume  Nt  ^  lOON  (this  is 
te  typical  for  distortion-invariant  problems  to  ade- 
itely  represent  all  distortions).  We  then  find 

>  <  200/  (5) 

ich  is  independent  of  N.  This  inequality  is  always 
isfled.  As  we  shall  see,  BP  has  always  required  at 
St  on  the  order  of  /  =  100  more  iterations  of  the  full 
ining  set  than  has  our  ACNN  algorithm.  In  this 
e 

5  <  2  X  io‘.  (6) 

i  the  computational  time  savings  of  our  algorithm  is 
ite  significant. 

rhus,  to  summarize,  in  the  two  steps  of  our  proto- 
>e  selection  algorithm  we  use  bN  random  samples 
m  the  full  Nt  set.  We  select  the  number  of  hidden 
irons  N  to  be  two  to  five  times  the  number  of  classes 
ipending  on  the  difficulty  of  the  problem).  Sec.  IV 
^ils  these  choices  for  two  examples. 

Initial  P1-P2  Weights 

iVe  now  address  how  we  select  the  initial  P\-Pt 
put-to-hidden  layer)  weights.  We  denote  the 
ight  between  input  neuron  j  and  hidden  neuron  i  by 
.  We  denote  the  vector  position  of  prototype  i  in 
r  D-dimensional  feature  space  by  p;  (i.e.,  this  is  the 
ture  vector  for  prototype  i)  and  element  j  of  it  by  p,;. 

5  can  now  describe  the  input  weights  from  Pi  to  P2  as 

p„  for)  “  1,. .  ..D 

D  (7) 

-(l/2)yp»for>-D  +  l. 

I- I 

e  first  D  (out  of  D  +  1)  elements  of  each  weight 
tor  from  Pi  to  layer -two  neuron  i  are  thus  the  fea- 
e  vector  Pi  associated  with  that  prototype.  The  last 
+  1)  input  neuron  activity  is  always  1  and  its  weight 
hidden  layer  neuron  1  is  associated  with  its  LDF 
eshold.  We  choose  these  initial  weights  since  they 
ure  that  the  classifier  initially  implements  a  near¬ 
neighbor  classifier  based  on  the  prototypes,  as  we 
f  detail. 

lach  hidden  neuron  i  has  connections  from  all  D  +  1 
ut  neurons  and  thus  has  a  weight  vector  w,-  associat- 
vith  it.  For  an  input  Xo,  the  input  to  neuron  i  in 
tr  two  is 

*.  =  P  5^:  -  5  p’,.  (8) 

re  the  first  term  is  the  contribution  to  the  VIP  from 
first  D  weights  and  the  last  term  is  the  contribution 
to  the  additional  D  +  1  input  neuron.  We  rewrite 
(8)  a? 


-S  S  z=VIP 


Fig.  2.  PerceptroD  criterion  function:  S  denotes  the  safety  margin 
and  the  solid  and  dashed  curves  correspond  to  classes  1  and  2, 
respectively. 

w*!,  =  (0.5)2p;x,  -  0.5p'p,  +  O.Sxlx.  -  O.Sxix. 

-  0.5lxlx.  -  (p'p,  -  2p'x.  +  xix.)) 

=  0.5{lx.P  -  Ip,  -  X.P).  (9) 

From  Eq.  (9)  we  see  that  the  VIP  is  related  to  the 
Euclidean  distance  (denoted  by  A  9)  between  the  input 
%a  and  the  prototype  p,  associated  with  hidden  neuron 
t.  The  choice  of  weights  in  Eq.  (7)  achieves  nearest- 
neighbor  classification  since  it  ensures,  from  Eq.  (9), 
that  the  hidden  neuron  closest  to  x^,  will  have  the 
largest  input  (since  the  second  term  in  Eq.  (9)  is  then 
smallest)  and  will  be  most  active. 

C.  Training  (Weight  Update)  Algorithm 

We  now  detail  how  we  update  the  initial  P1-P2 
weights  to  achieve  improved  piecewise  linear  decision 
surfaces.  We  input  each  of  the  full  Nt  set  of  training 
vectors  For  each  x^,  we  calculate  the  most  active 
hidden  neuron  t'(c)  in  the  proper  class  c  and  the  most 
active  one  i(C)  in  any  other  class  (d).  We  denote  the 
weight  vectors  for  these  two  layer-two  neurons  by  Wi(c) 
and  w,(f)  and  their  VIPs  with  the  input  by  wfj^jXo  and 
w‘(^jXa.  The  perceptron  error  function  (criterion 
function)  £p  used  is  shown  in  Fig.  2.  The  solid 
(dashed)  curves  correspond  to  the  true  (false)  classes  1 
and  2  cases.  The  offset  5  is  a  safety  margin  that  forces 
training  set  vectors  which  are  classified  correctly  by  a 
small  amount  (<S)  to  also  contribute  to  the  criterion 
function.  As  discussed  elsewhere, “  we  chose  S  =  0.05 
(all  features  were  normalized  between  0  and  1).  The 
use  of  5  forces  the  classifier  to  try  to  classify  all  training 
samples  correctly  by  at  least  an  amount  S,  improving 
test  set  performance  (and  thus  generalization). 

For  each  training  sample  in  Nr,  we  add  an  error 
(penalty)  to  Ep.  The  error  added  is 

£  »  0  if  w;,„x.  >  w;,„x,  +  s 

5  ~  otherwise.  (10) 

where  the  £  =  0  case  corresponds  to  the  situation  when 
the  proper  layer-two  neuron  is  most  active  (by  an 
amount  S  above  the  most  active  false  neuron)  and 
where  the  other  case  corresponds  to  the  situation  when 
the  false  class  VIP  is  larger  than  the  true  class  VIP,  or 
within  S  of  it. 

After  all  AV  training  samples  have  been  run  through 
the  system,  we  accumulate  all  of  these  errors  or  encr- 
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(a) 


Fig.  3.  Input  Pi  neuron  representation  space 
(wedge  sampled  Fourier  transform):  (a)  arciutec- 
ture;  (b)  Pj  sampling. 


gies  (all  are  positive  or  zero).  We  also  accumulate  the 
gradients  V^.E.  From  Eq.  (10),  by  taking  the  deriva¬ 
tive  with  respect  to  w„  we  see  that  is  zero  for  all  i 
when  an  input  is  classified  correctly  by  more  than  S; 
otherwise,  it  equals  either  (if  input  a  should  be 
classified  into  the  same  class  as  cluster-neuron  i)  or 
— Za  (if  a  is  incotrectly  classified  by  cluster  neuron  t). 
Thus,  the  sum  of  all  ^e  contributions  to  equals 
the  sum  of  the  drz^  for  samples  erroneously  classified 


(or  correctly  classified  but  with  a  margin  less  than  S)  in 
layer-two  clusters.  We  then  use  to  adapt  the 
weights  w  by  the  conjugate-gradient  algorithm.  We 
then  reneat  presentation  of  the  training  set  (a  new 
iteration),  calctilate  the  new  errors  E  and  and 
update  the  weights  accordingly.  This  procedure  re¬ 
peats  until  satisfactory  performance  on  the  test  set  is 
obtained. 

We  considered  other  LDFs  (Ho-Kashyap,  Fisher, 
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Fig.  4.  Three-class,  iwo  feature  anificiil  data- 
base  example  (liencbinark  1 1. 
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Fig.  5.  Nonlinear  decision  boundaries  produced 
for  the  artificial  database. 


Fukanaga-Koontz,  etc.)-  However,  these  LDFs  re¬ 
quire  more  calculation  than  does  our  current  algorithm 
to  update  the  weights.  Thus,  for  computational  rea¬ 
sons,  our  present  choice  (perceptron  criterion)  is  pref¬ 
erable. 

O.  Input  Py  Neuron  RepresentaUon  Space 
In  our  distortion-invariant  multiclass  pattern  recog¬ 
nition  applications,  we  use  a  wedge-sampled  magni¬ 
tude  Fourier  transform  feature  space,^*  since  this  fea¬ 
ture  space  can  easily  be  product  optically.  Figure 
3(a)  shows  the  standard  architecture  ^t  produces  the 
Fourier  transform  at  Pt  of  the  Pi  input  2-D  image  data. 
Figure  3(b)  shows  the  standard  wedge-ring  detector 
used  at  Pj.  The  wedge  features  provide  scale  invari¬ 
ance  and  the  ring  features  provide  in-plane  rotation 
invariance.  Our  distortion-invariant  data  will  involve 
different  aspect  views  of  several  objects  (and  not  in- 
plane  distortions) .  Thus,  we  chose  the  wedge  features 
Uhis  provides  scale  invariance,  although  we  do  not 
include  scale  distortions  in  our  test  data).  We  obtain 
aspect-view  invariance  by  training  on  various  aspect- 
distorted  object  views. 


IV.  Test  Results 

We  consider  two  databases:  an  artiHcial  set  of 
data'^'^  (to  demonstrate  the  nonlinear  surfaces  pro¬ 
duced  using  only  two  features)  and  a  set  of  three  air¬ 
craft  with  various  azimuth  and  elevation  (3-D)  distor¬ 
tions  present  We  refer  to  these  as  benchmarks  1  and 
2. 

A.  Benchmark  1  Results  (Artificial  Data) 

An  artificial  set  of  383  samples  in  three  classes  (181 
in  class  1,  97  in  class  2  and  105  in  class  3)  with  two 
features  was  generated  with  samples  as  shown  in  Fig.  4. 
This  problem  definitely  requires  a  nonlinear  decision 
boundary  and  the  results  can  be  shown  in  the  2-D 
feature  space.  This  is  the  purpose  of  this  example, 
since  no  separate  test  data  exist  The  neural  net  used 
contained  three  input  neurons  (two  for  the  features 
plus  one  for  the  threshold),  six  hidden  neurons  (two 
per  class)  and  three  output  neurons  (one  per  class). 
All  Nt  samples  were  us^  to  select  the  prototypes. 
The  first  r^uced  nearest-neighbor  clustering  pro¬ 
duced  thirty-one  prototypes  (8.1%  of  the  total  Nt)  that 
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Fig.  6.  Comparative  data  on  speed  of  co: ve rgence 
for  benchmark- 1  data. 


gave  an  error  rate  P,  =  0%  for  all  samples.  The  six 
prototypes  whose  removal  gave  the  most  error  were 
then  selected  in  stage  two. 

After  eighty  iterations  of  the  full  training  set,  the 
classification  rate  (defined  as  the  percentage  of  test 
samples  correctly  classified)  was  constant  at  97.1% 
with  our  ACNN  algorithm.  After  300  iterations  the 
BP  classification  rate  was  constant  at  approximately 
the  same  value  (96.3%).  (This  result  is  the  average 
obtained  over  ten  runs  with  different  random  initial 
weight  sets.)  The  final  input  layer  weights  to  the  six 
hidden  layer  neurons  correspond  to  six  straight  lines 
(LDFs)  in  the  feature  space.  For  BP  these  six  lines 
would  define  the  decision  surface.  In  the  ACNN  this 
is  not  the  case  (because  of  the  winner-takes-all  action 
at  P2).  The  decision-surface  lines  were  determined  by 
successively  providing  all  of  the  possible  feature  vec¬ 
tors  on  a  grid  of  xj  —  ig  values  (for  both  xi  and  X2  in  the 
interval  (0,1])  to  the  classifler  and  for  each  feature 
vector  determining  the  class  into  which  it  is  classified 
by  the  neiual  net.  The  decision  boundaries  indicate 
where  a  transition  in  classification  occurred.  The 
boundaries  thus  obtained  are  shown  in  Fig.  5.  They 
produce  four  separate  regions  of  feature  space  (two 
correspond  to  the  same  class  and  the  others  correspond 
to  the  other  two  classes). 


From  inspection  of  Fig.  4  one  would  estimate  that  a 
piecewise  linear  decision  surface  with  at  least  five 
straight-line  sections  would  be  needed  to  separate  the 
data  adequately  and  that  about  ten  errors  might  be 
expected.  Thus,  at  least  five  hidden  neurons  are  ex¬ 
pected  to  be  needed.  In  Fig.  4  we  see  that,  with  six 
hidden  neurons,  approximately  ten  classification  er¬ 
rors  are  made,  pror'ucing  the  error  rate  of  97.1%. 

Figure  6  compares  the  classification  rate  for  the  two 
neural  nets  and  for  a  multivariate  Gaussian  classifier. 
Both  neural  nets  give  comparable  classification  rates 
(97.1%  and  96.3%)  after  convergence,  whereas  the 
Gaussian  classifier’s  performance  b  worse  (89.5%)  and 
by  definition  does  not  vary  with  the  number  of  itera¬ 
tions  of  the  training  set.  The  speed  of  learning  of  the 
ACNN  is  much  faster  (convergence  in  80  iterations) 
than  for  BP  (approximate  convergence  in  300  itera¬ 
tions).  From  Eq.  (4)  this  represents  approximately  an 
additional  NtNI  =  (383)(6)(220)  «  505560  VIP  calcu¬ 
lations  required  with  BP.  The  prototype  selection 
steps  in  our  ACNN  algorithm  required  approximately 
0.5N%  =  (0.5X383)2  «  73350  VIPs  and  thus  the  total 
number  of  calculations  and  hence  training  time  for  our 
ACNN  is  considerably  less  than  the  learning  time  for 
BP.  We  reran  the  prototype  selection  portion  of  our 
ACNN  algorithm  using  only  5N  =  30  samples  random- 
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Fi«.  7.  Nonlinear  decision  boundaries  produced 
for  the  artificial  database  when  prototypes  are  se- 
festure  1  lected  from  reduced  training  set. 


ly  se]ect«d  from  the  383.  insuring  that  we  obtain  at 
least  one  prototype  per  class.  The  decision  bound¬ 
aries  produced  are  shown  in  Fig.  7.  As  can  be  seen,  the 
decision  boundaries  are  virtusily  identical;  the  result¬ 
ing  error  rates  differ  by  only  0.2%  (96.9%  classiftcation 
was  obtained  after  100  iterations).  This  was  now 
achieved  with  only  0.5(30)*  »=  450  VIPs  for  prototype 
selection. 

This  data  set,  therefore,  indicates  that  similar  per¬ 
formances  can  be  obtained  with  BP  and  ACNN,  with 
ACNN  training  appreciably  faster  than  BP.  We  have 
also  seen  that  the  time  for  prototype  selection  with 
ACNN  can  be  made  negligible  by  using  a  reduced 
number  of  learning  samples,  without  affecting  per¬ 
formance  adversely. 

B.  Benchmark  2  Results  (3-0  Distorted  Aircraft  Data) 

As  our  second  data  set,  we  used  synthetic  distorted 
aircraft  imagery  and  our  wedge-sampled  Fourier  fea¬ 
ture  space.  The  imagery  used  were  three  aircraft  (F-4, 
F-104,  and  DC-10)  binarized  to  128  X  128  pixels  with 
each  aircraft  occupying  about  the  central  100  X  64 
pixels.  As  our  training  set,  we  used  630  images  of  each 
aircraft  (a  total  of  Nr  -  1890  training  set  samples). 


The  images  were  different  azimuth  views  (with  the 
aircraft  viewed  from  different  angles  left  to  right)  and 
elevation  views  (with  the  aircraft  viewed  from  differ¬ 
ent  angles  above  or  below  its  center  lin**''  'Hie  range  of 
azimuth  angles  used  covered  — 85**  to  -t-85**  and  the 
elevation  angle  was  varied  from  0”  to  OO**  with  5° 
increments  in  each  angle  (the  same  image  results  if 
negative  elevation  angles  are  used).  The  input  neuron 
representation  space  was  a  thirty-two  element  feature 
space  (the  thirty-two  wedge  magnitude  Fourier  sam¬ 
ples).  The  test  set  used  consist^  of  578  orientations 
of  each  aircraft  not  present  in  the  training  set  (these 
were  views  at  intenud  angles  about  2.5°  in  each  direc¬ 
tion  from  those  in  the  training  set).  Figure  8  shows 
three  distorted  versions  of  each  aircraft.  The  left  im  - 
age  is  the  top-down  view  with  0°  variation  in  elevation 
and  azimuth.  The  central  image  shows  a  view  from  an 
azimuth  angle  of  45°  to  the  left  The  right  image  for 
each  object  shows  an  image  with  elevation  angle  of  45°. 

The  three-layer  ACNN  used  contained  thirty-three 
input  neurons,  nine  hidden  neurons  and  three  output 
neurons  (one  per  class). 

Figure  9  compares  the  speed  (number  of  iterations  of 
the  full  training  set)  and  classification  performance  for 
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Fig.  8.  Representative  images  for  the  three-class  3-D  distortion 
example  (benchmark  2). 


the  two  neural  nets  and  the  Gaussian  classifier.  Both 
neural  nets  yield  the  same  classification  rate  (98.6^) 
compared  to  only  89%  for  the  Gaussian  classifier.  BP 
converges  in  350  iterations  and  our  ACNN  in  fewer 
(180)  iterations.  As  with  the  2-D  data  set,  a  reduced 
data  set  for  prototype  selection  can  be  employed  suc¬ 
cessfully.  It  was  found  that  with  5N  =  45  samples 
used  for  prototype  selection,  98.6%  tlassification  per¬ 
formance  was  obtained  after  180  iterations.  With  this 
reduced  number  of  samples,  the  time  for  prototype 
selection  is  negligible  compared  with  the  time  for  a 
single  iteration,  so  that  the  relative  training  times  are 
again  determined  br  the  number  of  iterations  required  , 

for  each  method.  Thus,  ACNN  requires  approximaie-  ] 
ly  50%  of  the  training  time  of  BP.  > 

! 

V.  Optical  and  Optical/Eiectronic  Realization  { 

Many  choices  are  possible  for  the  role  of  optics  in  the 
learning  and  classification  stages  of  our  ACNN. 
These  are  now  discussed.  The  feature  space  (wedge- 
sampled  magnitude  Fourier  transform)  should  be  opti¬ 
cally  calculated  (even  in  learning)  since  this  feature 
space  is  easily  produced  optically^^-^  and  since  we  will 
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use  the  optically  produced  feature  space  in  our  on-line 
classification.  The  two  steps  of  prototype  selection 
are  best  performed  electronically,  since  they  are  off¬ 
line  operations  and  require  manipulation  of  stored 
data  and  control  operations  most  compatihle  with  digi¬ 
tal  electronics,  llie  distance  calculations  required  in 
the  nearest-neighbor  calculations  can  be  performed  on 
an  optical  VIP  architecture  (we  now  discuss  this  and 
the  use  of  optics  in  the  learning  stage). 

Once  the  initial  P1-P2  weights  have  heen  chosen,  the 
learning  stage  can  be  implemented  in  optics  or  elec¬ 
tronics.  Figure  10  shows  one  such  architecture.  The 
input  sample  Xo  is  entered  at  Pi  (on  LEDs,  laser  diodes 
or  a  1-D  spatial  light  modulator  (SLM)).  It  is  imaged 
onto  the  initial  set  of  N  weight  vectors  (for  the  N 
prototype  hidden  layer  neurons)  which  are  arranged 
on  rows  at  Pa  (with  the  first  two  to  five  rows  corre¬ 
sponding  to  the  prototypes  for  class  1,  the  next  two  to 
five  rows  being  the  prototypes  for  class  2,  etc.).  Thus, 
the  rows  at  Pa  are  the  initial  weights  as  given  in  £q.  (7). 
The  VIPs  of  Xa  and  all  of  the  w,-  weight  vectors  at  Pa  are 
formed  on  a  linear  detector  array  at  P2.  The  Pa  rows 
and  Pj  elements  are  separated  into  C  groups  (the  C 
classes).  The  maximum  VIP  element  in  ea^  class  is 
determined  (simple  compturator  logic  is  suffic'ent  since 
the  number  of  prototy]^  per  class  is  small).  This 
provides  us  with  wf(c)Xa  and  wf({,x«  in  Eq.  (10).  Bipo¬ 
lar  values  for  Wj  should  he  hsindled  by  spatial  multi¬ 
plexing  at  Pa  and  subtraction  of  adjacent  P2  outputs. 
Alternatively,  the  Pa  data  can  be  placed  on  a  bias  (but 
this  increases  dynamic-range  requirements).  The 
weights  must  be  updated  after  each  iteration  of  the 
training  set  If  Pa  is  a  microchannel  spatial  light 
modulator^  (or  similar  device)  that  can  record  positive 
and  negative  data  (with  a  bias  on  the  device),  we  can 
update  the  weights  by  adding  and/or  subtracting  the 
appropriate  values  for  each  weight.  These  updates  to 
the  weights  at  Pa  are  various  combinations  of  the 
training  vectors  Xa-  These  could  be  calculated  in  elec¬ 
tronics,  entered  sequentially  at  Pj  and  (with  a  mecha¬ 
nism  to  activate  only  selected  rows  at  Pa)  we  could 


update  Pa  as  required.  Alternatively,  we  could  repeat 
each  Xa  at  Pi  and  vary  the  input  illumination  and  the 
Pa  row  accessed  and  hence  control  the  amount  of  each 
Xa  added  to  or  subtracted  from  each  weight  vector  at 
Pa.  The  digital  control  required,  the  complexity  of 
the  system  (a  modulated  light  source  to  control  the 
amount  of  each  Xa  used,  access  to  only  one  row  of  Pa  at 
a  time),  the  need  for  N  accesses  of  Pa  for  each  of  the  Nt 
vectors  Xa,  and  the  Pa  SLM  requirements  make  the 
electronic  calculation  of  the  updated  weight  vectors 
and  the  electronic  off-line  implementation  of  the 
learning  stage  preferable  (at  present).  As  Pa  SLM 
technology  matures,  it  would  probably  be  realistic  to 
calculate  all  VIPs  optically,  determine  the  new  weights 
electronically,  and  reloa.'*  these  directly  into  Pa  after 
each  iteration  of  the  training  set.  However,  at  present, 
we  assume  that  all  learning  is  electronic  (since  it  is  off¬ 
line). 

Once  learning  has  been  completed,  the  P1-P2 
weights  are  fixed  and  the  input-to-hidden  layer  neu¬ 
rons  and  weights  (the  P1-P2  neuron  system)  can  be 
implemented  on  an  optic^  VIP  system  (such  as  P1-P2 
of  Fig.  10)  with  a  fixed  mask  at  Pa.  The  number  of  Pj 
neurons  is  modest  (the  input  neuron  representation  is 
a  compact  feature  space),  and  the  number  of  P2  neu¬ 
rons  is  also  small  (typically  less  than  five  times  the 
number  of  classes).  Our  ACNN  requires  a  winner- 
takes-all  (WTA)  maximum  selection  of  the  most  active 
P2  neuron.  This  can  be  implemented  with  a  WTA 
neural  network  or  in  standard  comparison  techniques. 
Since  the  number  of  P2  neurons  (7^0  is  small,  standard 
electronic  WTA  techniques  are  preferable  (we  quanti¬ 
fy  this  below).  Since  the  P2-P3  hidden-to-output  neu¬ 
ron  weights  are  fixed  and  are  all  unity  or  zero,  the  P2- 
P3  weights  simply  perform  a  mapping  and  can  easily  be 
implemented  in  electronics.  Thus,  we  implement  the 
input-to-hidden  layer  neuron  weights  and  calculations 
optically  and  the  hidden  layer  neuron  maximum  selec¬ 
tion  (WTA)  and  the  hidden-to-output  neuron  map¬ 
ping  in  electronics.  Figure  11  summarizes  the  learning 
and  classification  stages  in  block  diagram  form  with 
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Fig.  1 1 .  Block  diagram  for  adaptive-clustering  neural  net  using  (a) 
electronics  for  learning  (training)  and  (b)  optics  for  classification 
(on-line). 


attention  to  which  operations  are  performed  in  optics 
and  which  in  electronics. 

The  two  WTA  electronic  techniques  possible  (in 
classification)  are  to  use  an  operational  amplifier  peak 
detector  to  scan  all  N  outputs  at  P2  or  to  employ  a 
parallel  digital  technique.  In  the  digital  technique, 
the  N  outputs  are  AJD  converted,  each  pair  of  P2 
outputs  (1  and  2,  3  and  4  etc.)  are  pairwise  compared 
and  the  maximum  of  each  pair  is  obtained.  Pairwise 
comparisons  of  the  N/2  outputs  are  then  performed 
and  the  procedure  is  continued  for  logaiV  levels  until 
the  maximum  is  obtained.  For  100  input  and  hidden 
neurons,  one  matrix-vector  multiplication  (required 
to  update  the  P2  neuron  activities)  requires  about 
10,000  additions  and  10,000  multiplications;  whereas, 
maximum  selection  requires  only  about  100  compari¬ 
sons.  Thus,  the  maximum  selection  is  typically  negli¬ 
gible  computationally  compared  with  the  neuron  up¬ 
date  stage,  and  can  be  implemented  in  serial  electronic 
hardware  without  sacrificing  the  speed  of  the  system. 
We,  thus,  implement  the  WTA  operation  in  electronics 
using  comparators  rather  than  with  a  neural  net.  The 
specific  electronic  WTA  technique  chosen  depends  on 
the  accuracy  and  speed  required.  Since  these  opera¬ 
tions  are  required  once  for  each  test  input  in  classifica¬ 
tion,  the  WTA  time  required  is  set  by  the  rate  at  which 
new  input  image  data  occurs  and  the  rate  at  which  its 
features  can  be  calculated. 

VI.  Summary,  Conclusions  and  Discussion 

A  new  three-layer  adaptive-clustering  neural  net 
(ACNN)  has  been  described.  It  provides  for  a  new 
procedure  to  select  the  number  of  hidden  layer  neu¬ 
rons  (we  use  several  neurons  per  class,  each  being  a 
prototype  or  cluster  representive  of  a  particular  class) 
and  provides  initial  (non-random)  input-to-hidden 
layer  neuron  weights.  These  initial  weights  are  select¬ 
ed  using  standard  pattern  recognition  clustering  tech¬ 
niques.  They  are  then  updated  during  learning  using 
a  new  neural  net  adaptive  supervised  learning  algo¬ 


rithm.  This  results  in  a  new  neural  net  that  combines 
standard  pattern  recognition  and  neural  net  tech¬ 
niques  to  produce  piecewise  linear  decision  surfaces 
from  the  linear  discriminant  functions.  The  input 
neurons  are  analog  and  of  low  dimensionality  (a  fea¬ 
ture  space  with  inherent  distortion  invariances). 
Quantitative  data  show  that  the  learning  time  and 
number  of  calculations  required  in  our  new  ACNN  is 
significantly  faster  (by  a  factor  of  2  to  4)  than  the  more 
well-studied  BP  neural  net  We  also  found  that  the 
use  of  a  conjugate-gradient  (rather  than  gradient  de¬ 
scent)  update  algorithm  significantly  speeds  up  BP. 

BP  and  the  ACNN  will  usually  not  result  in  similar 
weights  since  BP  uses  neurons  for  other  operations 
besides  clustering,  because  BP  has  no  WTA  competi¬ 
tion  in  its  hidden  layer  as  in  the  ACNN  and  because  the 
hidden-to-output  weights  are  different  in  BP  and  only 
perform  mapping  in  the  ACNN.  However,  the  deci¬ 
sion  boundaries  that  result  are  usually  very  similar 
(with  the  ACNN  decision  boundaries  generally  being  a 
piecewise  linear  approximation  to  the  more  curved 
ones  in  BP).  Thus,  the  two  classifiers  employ  differ¬ 
ent  means  to  simQar  ends,  with  the  ACNN  providing 
faster  training  without  the  need  to  select  many  empiri¬ 
cal  parameters.  Since  only  one  hidden  neuron  in 
ACIW  is  dominant,  piecewise  linear  surfaces  result 
and  more  hidden  neurons  may  be  needed.  Our  intent 
is  not  to  compare  BP  and  our  ACNN,  rather  we  note 
the  attractive  properties  of  our  new  neural  net.  Be¬ 
sides  providing  a  new  way  to  select  the  hidden  neurons, 
our  neural  net  algorithm  has  only  one  ad  hoc  parame¬ 
ter  to  be  empirically  selected  (the  number  of  hidden 
neurons).  Changes  in  ACNN  weights  during  traini^ 
provide  information  on  the  data  that  can  be  of  use  in 
better  understanding  results  and  in  extending  results 
to  other  cases  (other  neural  nets  do  not  have  this 
property).  For  example,  in  sequential  gradient  de¬ 
scent  updating  algorithms  (the  delta  rxJe)  different 
results  occur  depending  on  the  order  in  which  the 
training  data  are  presented  and  depending  on  the  ran¬ 
dom  initial  weights  (by  comparison,  the  ACNN  pro¬ 
vides  consistent  results). 
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CHAPTER  3 


Ho-Kashyap  optical  associative  processors 


Brian  Telfer  and  David  P.  Casasent 


A  Ho-Kashyap  (H-K)  associative  processor  (AP)  is  shown  to  have  a  larger  storage  capacity  than  the 
pseudoinverse  and  correlation  APs  and  to  accurately  store  linearly  dependent  key  vectors.  Prior  APs  have 
not  demonstrated  good  performance  on  linearly  dependent  key  vectors.  The  AP  is  attractive  for  optical 
implementation.  A  new  robust  H-K  AP  is  proposed  to  improve  noise  performance.  These  results  are 
demonstrated  both  theoretically  and  by  Monte  Carlo  simulation.  The  H-K  AP  is  also  shown  to  outperform 
the  pseudoinverse  AP  in  an  aircraft  recognition  case  study.  A  technique  is  developed  to  indicate  the  least 
reliable  output  vector  elements  and  a  new  AP  error  correcting  synthesis  technique  is  advanced. 


I.  Introduction 

The  storage  capacity,^**  noise  performance^^  and 
key  vector  requirements^  of  associative  processors 
(APs)  are  of  major  concern.  This  paper  addresses 
these  issues  using  a  new  AP.  It  is  important  to  distin¬ 
guish  between  general  memory  and  pattern  recogni¬ 
tion  applications.  In  a  general  memory  application, 
APs  store  arbitrary  data  and  it  is  fair  to  assume  that 
the  keys  (input  vectors)  and  recollections  (output  vec¬ 
tors)  are  drawn  from  random  distributions  (these  APs 
are  tested  with  Monte  Carlo  methods).  We  define  the 
storage  capacity  of  an  AP  to  be  the  number  of  key/ 
recollection  vector  pairs  that  can  be  nearly  perfectly 
(99-100%)  stored  in  a  general  memory  application.  In 
pattern  recognition  problems,  an  AP  has  many  key 
vectors  (e.g.,  distorted  inputs)  associated  with  the 
same  recollection  vector  (a  class  label)  and  must  gener¬ 
ally  operate  on  shifted  and  distorted  input  patterns. 
A  large  number  of  keys  are  stored  to  represent  the 
distortions  of  the  different  classes.  In  pattern  recog¬ 
nition,  recall  accuracy  for  a  specific  use  is  more  impor¬ 
tant  than  storage  capacity. 

We  consider  APs  with  bipolar  binary  recollection 
vectors.  This  case  commonly  occurs  in  pattern  recog¬ 
nition,  where  the  recollection  vectors  are  class  labels. 
Our  key  vectors  have  analog  values  taken  from  arbi¬ 
trary  data  (for  the  general  memory)  or  a  feature  space 
(for  the  pattern  recognition  application).  We  consider 
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heteroassociative  processors  (HAPs),  in  which  the 
keys  and  recollections  differ,  rather  than  autoassocia- 
tive  processors  (AAPs),  in  which  the  keys  are  identical 
to  their  recollections. 

One  popular  AP,  the  Hopfield  memory,  has  been 
shown  empirically^  to  have  a  capacity  of  M  »  O.lSAf, 
where  M  is  the  number  of  keys  and  N  is  their  dimen¬ 
sion.  Theoretically,  an  asymptotic  (N  -*  <*>)  capacity 
of  M  =  N/(4  log2A0  has  been  shown  for  the  Hopfield 
memory.®  Because  of  its  very  low  capacity,  we  do  not 
further  consider  this  or  similar  correlation  APs® 
(where  the  memory  matrix  is  calculated  by  summing 
the  vector  outer  product  of  each  key  and  its  recollec¬ 
tion).  The  pseudoinverse  AP®  (where  the  memory 
matrix  is  calculated  from  the  pseudoinverse  of  the  key 
matrix)  is  preferable  because  it  perfectly  stores  key/ 
recollection  vector  pairs  as  long  as  the  ke)rs  are  linearly 
independent.  This  allows  up  to  M  =  N  vector  pairs  to 
be  perfectly  stored.  The  pseudoinverse  AP  also  has 
good  recall  accuracy  when  M  >  N.'  In  this  paper,  we 
discuss  how  an  AP  computed  by  the  Ho-Kashyap  (H- 
K)  algorithm^  has  better  recall  accuracy  and  a  larger 
capacity  than  the  pseudoinverse  memory.*  For  gener¬ 
al  memory  applications,  we  show*-'°  that  the  maximum 
storage  capacity  of  the  H-K  AP  is  Af  <=  2N,  and  that  it 
can  perfectly  store  keys  that  are  linearly  dependent. 
We  also  modify  the  H-K  AP  to  improve  its  noise  per¬ 
formance.  A  modified  version  of  the  algorithm  (an 
error  correcting  H-K  AP  algorithm)  that  allows  a  low 
accuracy  processor  to  be  used  is  also  advanced.  This  is 
of  particular  concern  when  an  optical  processor  is  em¬ 
ployed. 

Other  AP  work*  ‘■‘®  used  the  H-K  algorithm  for  com¬ 
puting  APs,  but  limited  the  number  of  key  vectors  to 
be  substantially  less  than  their  dimensionality.  [An 
overdetermined  problem  was  created  by  adding  key/ 
recollection  vector  constraints  to  map  unit  vectors 
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(with  a  single  1)  to  all-zero  vectors,  in  addition  to 
storing  the  key  vectors.]  A  major  emphasis  of  this 
paper  is  that  the  advantage  of  the  H-K  AP  over  the 
pseudoinverse  AP  occurs  when  the  number  of  key  vec¬ 
tors  exceeds  their  dimensionality  and  that  the  H-K  AP 
can  handle  linearly  dependent  key  vectors  (i.e.  M>N). 
Other  H-K  AP  work  also  used  output  thresholding  and 
feedback  in  recall  mode.  We  do  not  consider  this  for 
high  capacity  APs  (M  >  N).  To  use  feedback  in  H.^P 
recall  requires  a  bidirectional  associative  memory 
(BAM).'^  To  use  the  H-K  algorithm  to  synthesize  a 
BAM  requires  that  two  separate  forward  (key  -*  recol¬ 
lection)  and  reverse  (recollection  — ►  key)  mappings  be 
computed.  This  significantly  increases  the  complex¬ 
ity  of  the  processor  and  the  amount  of  storage  re¬ 
quired,  and  hence  makes  the  bidirectional  processor 
less  desirable  for  optical  implementation.  Also,  the 
B.^M  capacity  is  limited  by  the  minimum  of  N  and  K, 
where  N  and  K  are  the  key  and  recollection  vector 
dimensions.’"  When  K  is  small  (which  is  to  be  expect¬ 
ed  in  an  HAP  when  the  recollections  are  used  for 
decisions  or  class  labelling),  the  BAM’s  capacity 
(where  the  BAM  is  constructed  using  the  H-K  algo¬ 
rithm)  will  cause  the  memory’s  capacity  to  be  less  than 
that  for  the  unidirectional  processor  we  consider. 

Another  approach  to  associative  storage  is  the  direct 
storage  nearest-neighbor  (DSNN)  AP.’®  For  bipolar 
binary  keys,  the  memory  matrix  simply  contains  the 
key  vectors  as  its  rows.  In  recall,  the  output  vector 
resulting  from  multiplying  the  memory  matrix  by  the 
input  vector  has  elements  that  are  the  vector  inner 
products  of  the  input  with  each  key.  The  largest  out¬ 
put  element  indicates  which  key  has  the  smallest  Ham¬ 
ming  distance  to  the  input.  The  corresponding  recol¬ 
lection  vector  can  then  be  selected  as  the  final  output. 
The  Hamming  Net’®  operates  on  the  same  principles. 
The  DSNN  AP  can  also  be  extended  to  analog  keys. 
The  AP  then  finds  the  key  with  the  smallest  Euclidean 
distance  to  the  input.  The  DSNN  AP  has  several 
attractive  properties.  It  is  trivial  to  synthesize  and  to 
update,  and  it  is  guaranteed  to  output  the  recollection 
whose  the  key  is  closest  to  the  input. 

The  other  APs  that  we  have  mentioned  (correlation, 
pseudoinverse,  H-K)  are  more  difficult  to  update  (al¬ 
though  the  correlation  AP  is  still  relatively  simple  to 
update).  They  are  also  not  guaranteed  to  output  the 
recollection  whose  key  is  closest  to  the  input,  although 
they  do  so  for  low  input  noise  levels.  We  believe  that 
the  main  advantage  of  these  three  APs  over  the  DSNN 
AP  is  that  their  memory  matrices  can  be  smaller.® 
The  DSNN  AP  memory  matrix  has  MN  elements, 
while  the  other  AP  memory  matrices  each  have  KN 
elements,  where  K  is  the  recollection  vector  dimension. 
Thus,  the  other  APs  have  fewer  matrix  elements  than 
the  DSNN  AP  when  K  <  M.  This  condition  is  true 
when  the  recollections  are  class  labels  and  have  a  low 
dimension.  These  low  dimensional  labels  from  an  .AP 
can  be  used  to  read  out  high  dimensional  recollection 
vectors  from  an  addressable  memory.  In  addition,  the 
H  -  K  AP  can  store  M>  N  key  vectors,  and  in  this  case  K 
<  Af  even  if  K  =  N.  For  optical  implementations. 


where  space  bandwidth  product  is  a  major  concern,  the 
difference  in  memory  matrix  size  is  important.  The 
longer  updating  times  for  the  pseudoinverse  and  H-K 
APs  are  not  a  major  concern  for  applications  utilizing 
gated  learning,®”  where  most  time  is  spent  in  recall 
mode,  and  learning  is  only  initiated  after  a  significant 
event  has  occurred. 

In  Sec.  II,  we  review  the  pseudoinverse  AP  and  es¬ 
tablish  our  notation.  Section  III  advances  our  H-K 
algorithms.  Optical  implementation  of  these  proces¬ 
sors  is  considered  in  Sec.  IV.  Section  V  gives  theoreti¬ 
cal  and  simulation  results  for  the  general  memory  ap¬ 
plication.  A  case  study  of  distortion  invariant  aircraft 
recognition  is  presented  in  Sec.  VI.  In  Sec.  VII,  we 
offer  a  summary  and  conclusion. 

II.  Pseudoinverse  Associative  Processor  Formulation 

Denoting  the  keys  and  recollections  as  the  vectors  x* 
(N-dimensional)  and  y*  (K-dimensional),  respective¬ 
ly,  where  k  =  1,. . .  JVf,  the  vectors  xj,  and  y*  form  an 
associated  key/recollection  pair  (there  are  M  such 
pairs).  We  desire  a  K  X  N  matrix  M  satisfying 

y*  =  sgn(M».t),  (1) 

for  fe  =  1,. . .  M,  where  sgn(Mx*)  indicates  that  a  sig- 
num  function  is  applied  to  each  vector  element  (sgn(x) 
=  1  if  X  >  0  and  sgn(x)  =  —1  otherwise).  Defining 
matrices  X(N  X  Af)  and  Y(K  X  M)  with  the  key  and 
recollection  vectors  as  their  columns,  Eq.  (1)  can  be 
rewritten  as 

Y  =  sgn(MX).  (2) 

It  is  useful  to  distinguish  between  autoassociative  pro¬ 
cessors  (AAPs),  in  which  Y  =  X,  and  heteroassociative 
processors  (HAPs).  Autoassociative  processors  are 
used  for  restoring  partial  or  noisy  inputs.  Our  major 
concern  is  HAPs  since  they  are  useful  for  decisions  and 
pattern  recognition.  It  is  well  known  that  the  pseu¬ 
doinverse  AAP  degenerates  to  the  identity  matrix 
when  the  rows  of  X  are  linearly  independent,®  which  is 
likely  to  occur  when  M  >  N.  Although  this  is  clearly 
not  a  useful  processor,  it  does  correctly  recall  exact  key 
inputs,  and  the  H-K  algorithm  cannot  improve  on  it. 
This  is  another  reason  why  this  paper  considers  only 
HAPs. 

A  solution  of  Eq.  (2)  is® 

M  -  YX*.  (3) 

where  X"®  is  the  pseudoinverse  of  X.  If  the  key  vectors 
are  linearly  independent,  then  Eq.  (3)  is  guaranteed  to 
satisfy  Eq.  (2)  exactly.  We  find  linearly  independent 
key  vector  requirements  to  be  unrealistic.  When  the 
key  vectors  are  linearly  dependent,  the  solution  in  Eq. 
(3)  is  approximate.  This  is  guaranteed  to  be  the  case 
when  M>  N.  Such  an  approximate  solution  is  useful 
and  allows  a  larger  storage  capacity  (Af  >  N).  Only 
limited  attention  has  been  given  to  such  cases.'  ®  ®  We 
will  refer  to  Eq.  (3)  as  an  exact  pseudoinverse  AP 
(when  M  <  N  and  the  keys  are  linearly  independent) 
and  as  an  approximate  pseudoinverse  AP  (when  Af  > 
N). 
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To  compute  the  pseudoinverse,  we  use  the  singular 
value  decomposition  (SVD)  approach^'  because  it  can 
be  used  for  either  linearly  independent  or  dependent 
keys  and  because  it  allows  us  to  improve  the  AP’s  noise 
performance  by  a  method  that  will  now  be  explained. 
The  conventional  pseudoinverse  HAP  recalls  noisy 
key  vectors  poorly  when  M  «  N  (with  much  better 
performance  occurring  when  M  <  N  and  M  >  AO®  A 
recent  paper’  explains  this  phenomenon  and  shows 
that  to  optimize  recall  accuracy  for  a  particular  noise 
variance  a-,  all  singular  values  n,  satisfying 

<  \M(t  (4) 


output  recollection  element.  The  pseudoinverse  AP 
minimizes  the  squared  error,  but  is  not  guaranteed  to 
give  perfect  recall  even  if  the  K  class  groupings  (one  for 
each  output  recollection  vector  element)  are  all  linear¬ 
ly  separable.^2  The  Ho-Kashyap  algorithm  iteratively 
computes  an  LDF  (i.e.,  one  row  of  M)  that  will  correct¬ 
ly  classify  two  classes  if  they  are  linearly  separable.^^ 
If  they  are  not  linearly  separable,  the  algorithm  will 
still  converge  to  a  minimum  squared  error  solution, 
and  will  indicate  that  the  classes  are  not  linearly  sepa¬ 
rable  and  which  output  vector  elements  may  not  be 
correct. 


should  be  set  to  zero.  Then  the  memory  matrix  is 
computed  by' 

M  =  YX\  (5> 

where  X  is  the  key  matrix  X  with  small  singular  values 
set  to  zero.  For  realistic  <7  values,  this  method  causes 
only  a  small  decrease  in  the  recall  accuracy  for  exact 
key  vector  inputs  when  M  ^  N,  and  significantly  im¬ 
proves  recall  accuracy  when  noise  is  present.  The 
method  is  attractive  since  it  does  not  alter  the  pseu¬ 
doinverse  HAP’s performance  for M«N and M » N. 
We  use  this  approach  in  our  simulations  described  in 
Sec.  V,  which  confirm  the  above  statements.  We  note 
that  it  is  well  known^*  that  very  small  singular  values 
(e.g.,  lO"*)  should  always  be  set  to  zero  to  avoid  prob¬ 
lems  of  numerical  instability.  The  method  explained 
above  differs  from  this,  in  that  the  threshold  for  zero¬ 
ing  the  singular  values  is  given  as  a  function  of  M  and  a 
(as  opposed  to  selecting  it  arbitrarily)  and  that  the 
singular  values  set  to  zero  can  exceed  10"^  by  orders  of 
magnitude  (e.g.,  if  Af  =  50  and  cr  =  0.1,  the  threshold  for 
n  is  0.71). 

III.  Ho-Kashyap  Associative  Processors 

The  H-K  AP  has  a  larger  storage  capacity  than  the 
pseudoinverse  AP  because  it  requires  that  the  key 
vectors  be  only  linearly  separable  for  perfect  recall, 
rather  than  linearly  independent,  as  the  pseudoinverse 
AP  requires.  Since  linear  separability  is  a  looser  re¬ 
striction  than  linear  independence,  the  H-K  AP  in 
many  cases  can  perfectly  store  linearly  dependent 
keys.  Before  presenting  the  H-K  algorithm,  we  first 
describe  how  linear  separability  applies  to  APs. 

A.  Linear  Separability  arxl  the  H-K  Algorithm 

Recall  that  the  columns  of  Y  are  the  recollection 
vectors  for  the  different  key  vectors.  Hence,  row  i  of  Y 
gives  the  desired  values  of  the  ith  output  element  for 
the  different  key  vectors.  Each  row  of  the  AP  matrix, 
with  its  threshold  value,  forms  a  linear  discriminant 
function  (LDF)  that  separates  the  A^-dimensional  in¬ 
put  space  with  a  hyperplane  into  two  classes,  those  key 
vectors  for  which  element  i  of  the  recollection  vector  is 
- 1  and  those  for  which  it  is  -H.  The  locations  of  the 
±  1  elements  in  row  i  of  Y  denote  these  two  classes  for 
that  row.  If  these  two  classes  can  be  separated  with  a 
hvperplane,  then  they  are  linearly  separable  and  there 
exists  an  LDF  that  will  give  perfect  recall  for  that 


B.  Ho-Kashyap  APs 

The  H-K  algorithm  in  a  new  matrix  version  for  AP 
synthesis  is  noted  in  Table  I.  We  begin  with  an  esti¬ 
mate  of  M  from  the  pseudoinverse  (step  1).  The  pseu¬ 
doinverse  memory  is  only  an  estimate  because  it  is  only 
an  approximate  solution  for  M  >  N.  We  modify  Y 
(step  4)  and  M  (step  1)  in  successive  iterations.  If  the 
pseudoinverse  is  exact  (i.e.,  the  keys  are  linearly  inde¬ 
pendent)  then  no  modifications  will  be  made.  The  H- 
K  algorithm  improves  the  pseudoinverse  memory 
when  the  keys  are  linearly  dependent.  In  step  2,  we 
calculate  the  error  matrix  E,  which  gives  the  errors 
between  the  actual  and  desired  outputs.  The  matrix  S 
in  step  3  contains  the  signs  of  the  Y  elements,  0  de¬ 
notes  Hadamard  (pointwise)  multiplication,  and  the 
subscript  n  is  the  iteration  index.  In  step  3,  we  use  S  to 
form  a  modified  error  matrix  E'.  This  matrix  equals  E 
except  that  all  E  elements  that  differ  in  sign  from  the 
corresponding  elements  in  Y  are  set  to  0.  This  ensures 
that  none  of  the  Y  elements  change  sign  (we  assume 
initial  bipolar  binary  Y  values)  when  E'  is  added  to  Y  in 
step  4  to  produce  an  updated  Y.  The  signs  of  Y  cannot 
be  allow^  to  change  sign  because  the  signs  determine 
on  and  off  recollection  vector  elements.  Step  5  then 
returns  the  algorithm  to  step  1  where  M  is  updated. 
Once  =  0,  the  algorithm  has  converged  (conver¬ 
gence  is  guaranteed  whether  the  keys  are  lineeirly  sepa¬ 
rable  or  not).  If  a  row  of  E„  equals  0  then  that  row’s 
dichotomy  (grouping  into  two  classes)  is  linearly  sepa¬ 
rable;  otherwise  it  is  not.  In  actual  application,  the 
algorithm  can  also  be  stopped  if  M  correctly  recalls  all 
of  the  key  vectors. 

C.  Most  Reliable  Recollection  Vector  Elements 

Since  the  Hnal  E  indicates  which  output  elements 
give  perfect  recall  for  the  key  vector  inputs,  the  H-K 
algorithm  automatically  provides  information  about 
which  output  elements  are  the  most  reliable.  If  the 


Step 

1 

2 

3 

4 

5 


T«M*  I.  Ho-KMhyip  AP  Algartthm 

Operation 
M„  =  Y„X+ 

E„  =  M„X  -  Y„ 

E;  =  S®  ^[(S®E„)-|-1S0E„|] 

Y„+i  =  Y„  +  2pE(,,  0  <  p  <  1 

If  EJ,  5^  0  go  to  1 . 
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data  are  linearly  separable,  E  and  E'  will  be  all  zero 
(and  the  new  Y  and  M  achieve  this  linear  separation). 
If  E'  is  all  zero  and  E  is  not,  at  least  one  row  of  Y  is  not 
linearly  separable  (but  the  resultant  M  is  better  than 
the  approximate  pseudoinverse  solution,  in  that  it  has 
a  lower  squared  error).  The  rows  of  E  and  E'  with  all 
zero  elements  denote  the  output  elements  that  are 
reliable.  This  allows  us  to  consider  the  reliable  out¬ 
puts  first  and  then  the  other  elements  with  a  reduced 
confidence  algorithm. 

D.  Robust  Ho-Kashyap  AP 

Our  basic  H-K  AP  (Table  I)  uses  the  exact  or  ap¬ 
proximate  pseudoinverse  AP  as  an  initial  solution  and 
then  refines  it.  Since  the  pseudoinverse  is  used  in  the 
basic  Ho-Kashyap  algorithm,  the  resulting  memory 
will  suffer  the  same  recall  deficiencies  as  the  pseudoin- 
verse_ memory  when  M  N.  We  therefore  propose 
that  in  Eq.  (5)  be  used  instead  of  in  Table  I.  We 
call  this  the  robust  H-K  AP.  This  combination  of  the 
H-K  algorithm  and  setting  small  eigenvalues  to  zero  is 
quite  new. 

Since  X"*^  9^  X"*^  in  all  cases,  the  robust  H-K  AP  is  not 
guaranteed  to  find  a  linearly  separable  solution  if  one 
exists.^  However,  this  is  not  a  major  problem.  We 
have  X*  9^  X*  only  when  M  »  N,  when  the  keys  are 
likely  to  be  easy  to  linearly  separate.  Thus,  we  are  still 
likely  to  be  able  to  find  the  solution,  even  though 
differs  slightly  from  X+.  As  M  grows  larger,  the  ke^s 
tend  to  become  harder  to  linearly  separate,  but  X* 
tends  to  become  identical  to  X"'’,  which  guarantees  that 
a  linearly  separable  solution  will  be  found  if  one  exists. 
These  comments  are  confirmed  by  the  simulations  of 
Sec.  \'.B.  Since  X*  ^  X+  in  some  cases,  we  must  also 
find  the  conditions  under  which  the  robust  algorithm 
is  guaranteed  to  converge.  We  have  shown  (Appendix 
A)  that  its  convergence  conditions  are  identical  to 
those  for  the  original  algorithm,  that  is,  0  <  p  <  1  in 
step  4  in  Table  I. 

E.  Error  Correcting  Ho-Kashyap  AP  Algorithm 

We  now  mention  the  use  of  an  error  correcting  H-K 
algorithm^'*  that  can  be  used  to  produce  a  new  error 
correcting  H-K  AP  algorithm,  which  does  not  require 
an  initial  X'*'.  Because  of  its  error  correcting  nature 
and  the  fact  that  SVD  is  not  used,  we  expect  it  to 
tolerate  lower  accuracy  than  the  first  H-K  algorithm 
(Table  I).  Hence  it  appears  attractive  for  optical  im¬ 
plementation.  The  algorithm  updates  Y  and  M  using 

Y„,,  -  Y,  +  E, 

(6) 

M„„  -M„  +  p(S®lEj)X^R. 

where  R  can  be  any  positive  definite  N  X  N  matrix. 
The  simplest  choice  is  R  =  I. 

IV.  Optical  Implementation 

The  recall  operation  of  the  pseudoinverse  and  H-K 
.APs  can  be  performed  by  the  standard  optical  analog 


Fig.  1.  Analog  optical  matrix-vector  multiplier  for  associative  pro¬ 
cessor  recall. 


matrix-vector  multiplier  shown  in  Fig.  1.  The  optical 
system  is  attractive  for  its  high  speed  parallel  comput¬ 
ing  power.  The  system  operates  as  follows.  The  Pi 
input  plane  contains  N  point  modulators  with  light 
outputs  proportional  to  z.  Elach  element  of  x  uni¬ 
formly  illuminates  one  column  of  the  memory  M,  a 
transmittance  array  ofKxN  elements  at  P2,  and  the 
light  leaving  is  integrated  horizontally  onto  K  detec¬ 
tors  at  P3.  The  detector  output  is  the  matrix-vector 
product  y  =  Mx.  Passing  this  through  a  signum  func¬ 
tion  gives  the  desired  final  output  y  =  sgn(Mx).  The 
matrix  M  will  be  bipolar.  We  note  that  a  variety  of 
techniques  have  been  developed  for  optically  repre¬ 
senting  bipolar  data.^^-^'^ 

V.  General  Memory  Ho-Kashyap  Associative 
Processors 

We  first  review  theoretical  work  and  then  report  our 
simulation  results. 


A.  Theoretical  Results 

Classic  theoretical  results  allow  us  to  estimate  the 
storage  capacity  of  the  H-K  AP.  The  results  assume 
that  the  M  N-dimensional  key  vectors  are  in  general 
position.  For  a  group  of  vectors  to  be  in  general  posi¬ 
tion,  no  subset  of  N  vectors  can  be  linearly  dependent. 
Thus,  restricting  vectors  to  be  in  general  condition  is  a 
looser  condition  than  linear  independence.  There  are 
2**  possible  dichotomies  of  these  vectors.  The  fraction 
of  these  that  are  linearly  separable  is*°-^ 


nMjrt 


fi 


M<N 


M>N. 


(7) 


When  the  keys  cannot  be  assumed  to  be  in  general 
position,  Eq.  (7)  is  an  upper  bound.  We  extend  Eq.  (7) 
to  an  associative  memory  formulation  (with  K-element 
recollection  vectors)  by  finding  the  fraction  of  groups 
of  K  dichotomies  that  are  all  linearly  separable.  This 
gives  the  fraction  of  all  possible  Y  matrices  that  can  be 
correctly  recalled  in  an  H-K  AP.  Our  fraction  is  Eq. 
(7)  raised  to  the  /fth  power: 
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Fig.  2. 


(a) 


(b) 


Fraction  of  (a)  all  dichotomies  of  M  Af-dimensional  vectors  that  are  linearly  separable,  and  (b)  all  groups  of  K  dichotomies  of  M  Af-di- 

mensional  vectors  that  are  linearly  separable. 


MSN 

M>N. 


(8) 


The  asymptotic  limit  (for  K  =  N)  aa  N  -*<*>  is  (modi¬ 
fied  from  Ref,  28) 


g(AfJVA)  -  {J 


MIN  <2 
MIN  >  2. 


(9) 


infinite  N,  but  Eq.  (8)  allows  us  to  estimate  the  storage 
capacity  for  Tinite  N.  For  example,  for  N  =  125,  we  see 
from  Fig.  2(b)  that  the  probability  that  all  rows  of  Y 
designate  linearly  separable  groupings  is  essentially  1 
for  a  large  storage  capacity  M<\  .5N.  Even  if  a  row  of 
Y  does  not  specify  a  linearly  separable  grouping,  it  is 
still  possible  for  the  corresponding  output  element  to 
be  correct  much  of  the  time. 


Thus,  the  maximum  storage  for  an  H-K  AP  in  a  general 
memory  application  (random  keys  and  recollections)  is 

M‘2N.  (10) 

This  asymptotic  limit  is  not  achievable  with  finite 
length  (N)  key  vectors.  If  N  is  increased,  M  can  be 
increased  accordingly  (at  the  cost  of  increased  memory 
size).  The  value  of  N  can  be  increased  with  higher 
order  APs*®-®-**  or  by  forming  random  combinations  of 
the  original  key  vector  elements."-®*  We  do  not  con¬ 
sider  these  approaches,  but  our  work  in  increasing  the 
AP  capacity  as  a  function  of  N  applies  to  the  trans¬ 
formed  key  vectors  produced  by  these  methods. 

FigUK  2(a)  plots  Eki.  (7)  vs  M/N.  It  shows  the 
probability  (the  fractional  amount)  that  one  row  of  Y 
specifies  a  linearly  separable  grouping  of  key  vectors. 
Figure  2(b)  plots  Eq.  (8)  vs  M/N  for  K  -  N.  It  shows 
the  probability  that  all  K  rows  of  Y  specify  linearly 
separable  groupings.  As  seen,  the  maximum  storage 
capacity  of  M  =  2N  cannot  be  achieved  except  with 


B.  Ho-Kashyap  and  Pseudoinverse  General  Memory 
Simulations 

We  now  test  random  H-K  APs  for  agreement  with 
the  above  theory  and  for  comparison  to  pseudoinverse 
APs.  We  use  Af  *=  50  element  key  vectors,  K  =  Af  =  50 
element  recollection  vectors  and  vary  M/N.  (This 
general  memory  uses  equal  key  and  recollection  vector 
dimensions.  If  the  AP  outputs  were  labels  used  to 
read  out  high  dimensional  recollection  vectors  from  a 
larger  second  stage  standard  addressable  memory,  the 
AP  outputs  would  be  of  dimension  K  <  N  and  the 
memory  matrix  would  be  smaller.)  When  M/N  >  1, 
the  key  vectors  are  automatically  linearly  dependent. 
All  key  vectors  were  randomly  chosen  and  uniformly 
distributed  over  —  1  to  + 1 .  Each  bipolar  binary  recol¬ 
lection  vector  element  was  chosen  randomly  to  be  —1 
or  -FI.  For  each  M/N  value,  ten  X  and  ten  Y  matrices 
were  generated.  Our  results  are  averaged  over  the  ten 
resulting  memory  matrices  for  each  M/N  value  tested. 
The  H-K  synthesis  algorithm  used  p  =  0.5  and  was 
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Fig.  3.  Recall  accuracy  vs  M/N  for  exact  and  noisy  key  vector  inputs  using  (a>  pseudoinverse  associative  memory  and  (b>  basic  Ho-Kasbyap 

associative  memory. 


limited  to  a  maximum  of  1000  iterations.  The  algo¬ 
rithm  was  also  stopped  when  the  memory  perfectly 
recalled  the  exact  key  inputs  or  when  E'  ==  0.  To  test 
each  AP,  we  used  the  M  key  vectors  with  four  levels  of 
additive  zero-mean  Gaussian  noise:  <r  =  0.0,0.05,0.1.0.2. 
The  last  three  nonzero  noise  levels  correspond  to  sig- 
nal-to-noise  ratios  of  21,  15,  and  9  dB,  respectively. 
Since  the  key  vector  elements  were  bounded  by  -1  and 
+ 1,  we  bounded  the  noisy  inputs  to  be  within  the  same 
limits.  This  limiting  only  slightly  improved  the  recall 
results  (less  than  1%  improvement).  The  recall  accu¬ 
racy  (percentage  of  correct  output  elements)  was  com¬ 
puted  for  each  noise  level.  Figure  3  shows  the  results 
for  the  standard  pseudoinverse  AP  (Fig.  3(a)]  and  the 
results  for  our  basic  H-K  AP  (Fig.  3(b)],  for  M/N  ratios 
of  0.2,  0.4,  0.6  to  1.6  in  0.04  increments,  1.8  and  2.0. 
These  results  show  improved  performance  and  storage 
capacity  for  the  H-K  vs  the  pseudoinverse  AP.  For 
discussion  purposes,  we  consider  an  AP  to  be  useful  if 
its  recall  accuracy  for  exact  key  inputs  exceeds  0.999. 
The  recall  accuracy  of  the  pseudoinverse  AP  exceeds 
0.999  for  exact  key  inputs  only  up  to  M  «  l.OAN,  and 
degrades  for  M  >  1.04N.  The  H-K  AP  exceeds  0.999 
re^lforAf  <  1.52/Vfor  inputs  with  no  noise.  Thisisa 
45%  improvement  in  the  capacity  of  the  H-K  over  the 
pseudoinverse  AP.  Thus,  the  H-K  AP  performs  sig¬ 
nificantly  better  than  the  pseudoinverse  AP.  Its  im¬ 
provement  over  the  correlation  HAP  (M  *  0.15A0'  is  a 
factor  of  10  or  900%  better  performance.  We  note  that 
in  all  cases,  the  performance  in  noise  degrades  when  M 
^  iV  (as  expected  for  APs  based  on  the  pseudoinverse). 
For  noisy  inputs  when  M  >  N,  the  H-K  AP  performs 
better  than  the  pseudoinverse  AP  at  low  noise  levels  (<r 


=  0.05  for  all  M/N  and  a  =  0.1  for  M  <  IA8N).  Al¬ 
though  the  pseudoinverse  AP  performs  better  at  high¬ 
er  noise  levels  (<r  =  0.1  for  M  2: 1.52A'  and  a  =  0.2  for  M 
>  N),  the  recall  accuracy  is  low  (<90%)  and,  thus,  this 
difference  is  not  of  concern  (since  neither  AP  performs 
very  well  for  these  noise  levels).  For  the  specific  case 
of  Af  ==  1.52iV,  the  H-K  AP  recall  was  0.05  higher  than 
the  pseudoinverse  AP  recaU  for  <r  =  0.05;  the  two  recall 
accuracies  were  nearly  identical  for  <r  -  0.1;  and  both 
recall  accuracies  were  low  (<90%)  for  o  =  0.2  when  M  > 
N.  Thus,  neither  AP  may  be  suitable  for  inputs  with  a 
large  amount  of  noise  (a  =  0.2)  at  high  storage  capaci¬ 
ties  (M  >  N).  But,  for  reasonable  noise  and  perform¬ 
ance,  the  H-K  AP  is  preferable  and  can  be  used  when 
M>N. 

We  now  consider  the  use  of  our  robust  H-K  AP  to 
improve  noise  performance  when  M  ^  N.  For  com¬ 
parison,  we  apply  the  robust  algorithm  with  small  ei¬ 
genvalues  removed  to  both  the  pseudoinverse  AP  and 
the  H-K  AP.  We  set  a  =  0.1  for  the  singular  value 
threshold  expressed  by  Eq.  (4).  The  exact  value  of 
this  threshold  is  not  critical,  since  our  choice  for  the 
threshold  also  gives  good  performance  for  a  -  0.05  and 
o  -  0.20.  The  results  are  shown  in  Figs.  4(a)  and  4(b) 
respectively,  with  M/N  varied  from  0.6  to  1.6  in  0.04 
increments.  As  seen,  both  APs  avoid  the  severe  drop 
in  performance  (when  M  *  N)  in  Fig.  3  for  noisy  inputs. 
For  the  robust  APs  with  exact  key  inputs,  we  obtained; 
Pc  ^  99.9%  for  M  <  0.84N  for  the  robust  pseudoin¬ 
verse  AP, 

Pc  >  99.9%  for  M  <  1.52N  for  the  robust  H-K  AP, 
80%  storage  improvement  with  robust  H-K  AP  over 
robust  pseudoinverse  AP. 
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1.  Recall  accuracy  vs  M/N  for  exact  and  noisy  key  vector  inputs  using  (a)  robust  pseudoinverse  associative  memory  and  (b)  robust  Ho- 

Kashyap  associative  memory. 
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sociative  memory  and  (b)  robust  Ho-Kashyap  associative  memory. 
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The  robust  pseudoinverse  AP  performs  worse  than 
the  standard  pseudoinverse  AP  with  no  noise,  but 
gives  much  better  recall  when  noise  is  present  and  M  » 
N.  The  robust  H-K  AP  performs  best  for  exact  key 
inputs  and  inputs  with  low  noise.  The  differences  in 
the  robust  H-K  and  pseudoinverse  AP  performance  for 
noisy  inputs  when  M  >  N  are  the  same  as  described 
above  for  the  standard  H-K  and  pseudoinverse  APs. 
This  is  expected,  since  omitting  small  singular  values 
only  affects  results  when  M  ^  N.  Hence,  the  recall 
accuracy  curves  away  from  M  =  N  are  the  same  for  the 
robust  and  standard  algorithms.  Figure  5  plots  the 
fraction  of  M  recollection  vectors  that  are  completely 
correct  for  the  robust  pseudoinverse  (Fig.  5(a)]  and  the 
robust  H-K  (Fig.  5(b)]  APs  vs  M/N  for  different 
amounts  of  noise.  Again,  the  robust  H-K  AP  is  prefer¬ 
able  when  the  recall  accuracy  is  good  (above  90%). 

In  Table  II,  we  show  the  rank  of  X  (the  key  matrix 
with  small  eigenvalues  set  to  zero)  for  N  =  50  and  for 
various  values  of  M.  The  entry  minjM,iV}  is  the  mini¬ 
mum  of  M  and  N  and  indicates  what  the  rank  of  X 
would  be  if  X  were  of  full  rank.  The  original  X  is  full 
rank  for  all  M/N.  By  comparing  the  min{Af,N)  and 
rank  entries,  we  see  when  small  eigenvalues  are  omit¬ 
ted.  For  M/N  =  0.8,  we  omit  an  average  of  0.7  eigenva¬ 
lues  and  for  Af/fV  =  1.0,  we  omit  an  average  of  5^7.  For 
M/N  ^  1.52,  no  eigenvalues  are  set  to  zero  and  X  =  X  is 
of  full  rank.  Thus,  the  robust  H-K  AP  differs  from  the 
standard  H-K  AP  for  0.16N  <  M  <  1.48N.  In  all  cases 
in  this  region  where  the  standard  H-K  AP  perfectly 
recalled  all  exact  key  inputs,  the  robust  H-K  AP  also 
gave  perfect  recall  accuracy.  This  experimental  evi¬ 
dence  confirms  the  argument  of  Sec.  III.D  that  the 
robust  H-K  algorithm  is  highly  likely  to  find  a  linearly 
separable  solution  if  one  exists. 

Table  III  shows  the  number  of  robust  H-K  iterations 
used  for  different  M/N  ratios  (with  N  =  50).  For  M/N 
<  0.8,  we  see  that  the  pseudoinverse  is  exact  (no  singu- 
leir  values  are  set  to  zero)  since  the  H-K  algorithm  does 
not  iterate.  For  M/N  >  0.8,  the  robust  pseudoinverse 


sets  some  singular  values  to  zero  and  the  H-K  algo¬ 
rithm  is  used  to  restore  the  recall  accuracy  for  noiseless 
key  vectors.  The  number  of  H-K  iterations  required 
increases  with  M/N  because  the  keys  become  more 
difficult  to  linearly  separate. 

We  note  good  agreement  between  theory  and  tests. 
For  both  the  standard  and  robust  H-K  APs,  all  APs 
tested  for  M/N  <  1.28  and  M/N  =  1.44  were  linearly 
separable.  Of  the  ten  APs  tested  for  each  other  M/N 
value,  at  least  one  of  each  set  was  not  linearly  separa¬ 
ble.  With  N  =  K  =  50,  Elq.  (8)  predicts  that  the 
transition  from  ten  linearly  separable  memories  to  at 
least  one  linearly  nonseparable  memory  will  occur  be¬ 
tween  roughly  M/N  =  1.34  and  M/N  =  1.56,  with  the 
probability  that  all  ten  memories  are  linearly  separa¬ 
ble  being  0.99  at  M/N  =  1.34  and  0.05  at  M/N  =  1.56. 
The  experimental  transition  occurs  at  the  lower  end  of 
the  theoretical  transition.  This  is  to  be  expected  since 
the  theoretical  transition  is  an  upper  bound  due  to  its 
general  position  assumption.  The  two  transitions  still 
agree  reasonably  well.  Thus,  Eq.  (8)  allows  us  to  esti¬ 
mate  capacity  of  the  H-K  AP  for  finite  N. 

VI.  Pattern  Recognition  Ho-Kashyap  Associative 
Processors 

This  section  presents  a  comparison  of  the  pseudoin¬ 
verse  and  Ho-Kashyap  APs  in  a  two-class  distorted 
aircraft  pattern  recognition  problem.  We  consider 
two  classes  (Phantom  and  D(^-10)  of  128  X  128  pixel 
aircraft  imagery.  Nominal  views  of  these  aircraft  are 
shown  in  Figs.  6(a)  and  6(b).  As  our  key  vector 
representation  space,  we  use  thirty-two  wedge  samples 
of  the  Fourier  transform  (in  half  of  the  transform 
plane).  The  wedge  feature  space  provides  scale  in¬ 
variance  (when  the  wedge  samples  are  normalized)  and 
shift  inveu-iance,  and  is  easily  generated  optically.^- 
In-plane  image  rotations  cause  the  wedge  samples  to 
circularly  shift.  We  consider  the  case  when  the  air¬ 
craft  is  moving  and  that  tracking  information  provides 
the  location  of  the  aircraft’s  nose.  This  information 
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Kig.  6.  Images  used  in  the  air¬ 
craft  recognition  problem',  fa) 
Phantom  and  (b)  DC-10  aircraft. 
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Fig.  7.  Wedge  samples  for  the  (a)  Phantom  and  (b) 
DC- 10  images  in  Fig.  6. 


allows  the  wedge  samples  from  an  unidentified  aircraft 
to  be  circularly  shifted  so  that  they  align  properly  with 
the  training  vectors.  Thus,  for  moving  aircraft  this 
feature  space  is  rotation  (in-plane),  scale  and  shift 
invariant.  Thus,  we  do  not  test  these  invariances. 
Rather,  we  consider  aircraft  with  out-of-plane  distor¬ 
tions.  Figures  7(a)  and  7(b)  show  the  feature  vectors 
corresponding  to  the  images  in  Fig.  6.  The  positions  of 
the  peaks  correspond  to  the  angles  of  the  ^ges  of  the 
object  and  the  peak  values  correspond  to  the  lengths  of 
the  edges.  This  interpretation  gives  the  wedge  feature 
space  an  intuitive  appeal.  We  augment  each  of  the  key 
vectors  with  a  1.  This  increases  the  dimension  of  the 
key  vector  hyperspace  from  N  to  N  +  1  and  does  not 


require  the  separating  hyperplanes  to  pass  through  the 
origin,  as  would  occur  otherwise.  This  technique  is 
well  known*®-^  and  in  APs  is  equivalent  to  varying  the 
thresholds  on  the  output  elements.*^  The  recollection 
vectors  are  of  dimension  A  =  1  with  values  of + 1  for  the 
Phantom  and  -1  for  the  DC-10. 

Two  training  sets  were  used.  The  first  consists  of 
882  key  vectors  for  the  two  aircraft  rotated  in  pitch  and 
roll  between  ±50®  at  5®  increments.  The  second  con¬ 
sists  of  2178  key  vectors  for  the  two  aircraft  rotated  in 
pitch  and  roll  between  ±80®  at  5®  increments.  Thus, 
we  consider  APs  with 
N  =  33-dimensional  key  vectors, 

M  —  882  and  2178  key/recollection  pairs, 

K  ®  1 -dimensional  recollection  vectors. 
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Both  cases  represent  linearly  dependent  key  vectors, 
with  large  capacities  M  -  25N  and  M  -  65N  respec¬ 
tively.  An  H-K  AP  was  produced  using  the  same 
stopping  criteria  as  in  Sec.  V.  (We  did  not  test  the 
robust  H-K  algorithm  since  the  test  vector  distortions 
are  not  easily  quantified  into  an  equivalent  noise  a.) 
The  first  case  (M  =  882)  was  found  to  be  a  linearly 
separable  problem,  since  the  H-K  algorithm  gave  100% 
correct  classification  after  thirty-two  iterations.  As 
shown  in  Table  IV,  the  H-K  AP  yields  perfect  perform¬ 
ance  on  the  training  set,  whereas  the  pseudoinverse  AP 
does  not.  For  test  data  in  Table  IV,  we  used  800 
aircraft  (not  present  in  the  training  set)  with  pitch  and 
roll  varied  between  ±47.5°  at  5°  increments  (i.e.,  at 
least  2.5°  different  in  pitch  and  roll  from  the  training 
data).  The  H-K  AP  also  gives  perfect  performance  for 
these  inputs  and  better  performance  than  the  pseu¬ 
doinverse  AP.  The  second  case  (M  =  2178)  represents 
a  linearly  nonseparable  problem,  as  shown  in  Table  V. 
The  H-K  algorithm  was  stopped  at  eighty  iterations 
since  E'  =  0  then.  However,  E  0,  and  thus  the 
algorithm  indicated  that  the  keys  were  not  linearly 
separable.  The  test  data  in  Table  V  used  2048  aircraft 
with  pitch  and  roll  varied  between  ±77.5°  at  5°  incre¬ 
ments.  We  see  that  the  H-K  AP  gives  excellent  per¬ 
formance  in  both  cases. 

VII.  Summary  and  Conclusion 

We  have  shown  that  the  Ho-Kashyap  associative 
processor  has  a  larger  storage  capacity  than  the  pseu¬ 
doinverse  processor  and  that  it  can  store  linearly  de¬ 
pendent  key  vectors  more  accurately  than  the  pseu¬ 
doinverse  processor.  We  have  detailed  a  new  robust 
Ho-Kashyap  processor  to  improve  the  noise  perform¬ 
ance  of  the  H  -  K  AP.  This  new  processor  allows  opera¬ 
tion  on  linearly  dependent  key  vectors,  achieves  much 
better  storage  (M  *=  2N  for  general  memory  applica¬ 
tions),  and  significantly  improves  noise  performance 
when  M  ^  N.  (The  last  advantage  is  due  to  incorpo¬ 
rating  Murakami  and  Aibara's  technique.')  For  N  = 
50  element  key  vectors,  we  showed:  100%  recall  accu¬ 
racy  for  our  Ho-Kashyap  general  memories  for  M  < 


Tabt*  IV.  Mtoclauinc«lton  RmuN*  lor  PMudoInvorM  and  Ho-Ka«hyop 
Momorloo  wHh  ^50°  Training  Sal 


%  Misclassified 

Training  Set: 

Pseudoinverse 

0.68 

H-K  AP 

0.00 

Test  Set: 

Pseudoinverse 

0.25 

H-K  AP 

0.00 

TaMa  V.  MInclaaalllcatlon  RaauNa  lor  Paaudoinvaraa  and  Ho-Kathyap 
Mantorlat  arNh  :ttO°  TraMrtg  Sal 


%  Misclassified 

Training  Set: 

H  K  AP 

7.0 

Test  Set: 

H  K  AP 

6.1 

1.5N  and  99.7%  accuracy  with  M  =  1.6N;  nearly  900% 
larger  storage  capacity  than  a  correlation  AP;  40% 
larger  storage  capacity  than  the  pseudoinverse  memo¬ 
ry;  and  90%  improved  noise  performance  when  M^N. 
C>ur  pattern  recognition  case  study  showed  3-D  distor- 
cion  invariance  and  excellent  (>93%)  recall  accuracy 
for  large  M  =  2bN  and  M  =  QbN  cases  with  linearly 
dependent  key  vectors.  We  have  discussed  an  optical 
architecture  for  implementing  H-K  recall.  The  error- 
correcting  AP  algorithm  that  we  propose  appears  at¬ 
tractive  for  optical  AP  synthesis  because  of  its  expect¬ 
ed  low  dynamic  range  requirements. 

Appendix  A:  Convergence  Proof  for  Robust 
Ho-Kashyap  Algorithm 

We  prove  that  the  robust  H-K  algorithm  converges 
when  0  <  p  <  1 .  Without  loss  of  generality,  we  consid¬ 
er  the  case  where  K  =  1  (i.e.,  Y  and  M  are  row  vectors). 
To  simplify  notation,  let  m  be  an  N  X  1  column  vector 
that  equals  M^,  and  b  be  an  Af  X  1  column  vector  that 
equals  Y^,  and  let  Z  =  X^,  and  Z  =  X^.  We  also 
multiply  all  key  vectors  belonging  to  the  second  class 
by  —1.  This  makes  the  desired  outputs  b  all  positive. 
(Initially,  b  is  all  ±1  and  during  the  iterations  the 
output  elements  change  but  remain  positive.)  The 
robust  H-K  algorithm  is  now  given  by 


Step  Operation 

1  m,  =  2*b„  (Al) 

2  e,  =  Zm„  -  b„.  (A2) 

3  e,  =  (l/2)(e, -t-lej),  (A3) 

4  b,,,  =  b,-t2r,e„.  (A4) 

5  If e,  Ogo  to  1.  (AS) 


Step  3  sets  all  negative  e  elements  to  0.  The  modified 
key  matrix  2  is  used  in  the  pseudoinverse  in  Eq.  (Al)  to 
improve  noise  performance  when  M  ^  N.  The  un- 
mr^ified  key  matrix  Z  is  used  to  compute  the  error  in 
Eq.  (A2)  because  we  want  all  the  noiseless  keys  to  be 
correctly  recalled. 

The  proof  follows  the  same  steps  as  in  Ref.  22  for  the 
original  H-K  algorithm.  However,  the  proof  in  Ref.  22 
requires  that  M  >  N  and  that  X  be  full  rank.  These 
are  valid  assumptions  for  overdetermined  PR  applica¬ 
tions,  but  not  for  the  general  memory  application. 
Our  proof  for  the  robust  algorithm  makes  no  such 
assumptions. 

The  proof  makes  use  of  the  facts  that  Z2^  is  symmet¬ 
ric,  positive  seraidefinite  and  idempotent  (i.e.,  the 
square  of  the  matrix  equals  itselO.  We  show  these 
properties  using  the  SVD-'  of  Z,  which  is  given  by  Z  = 
U2V^,  where  U  and  V  are  M  X  R  and  N  X  R  matrices 
{R  is  the  rank  of  Z)  with  orthonormal  columns,  and  Z  is 
anRxR  diagonal  containing  the  R  singular  values  of  Z. 
Note  that  UTJ  =  =  I.  The  pseudoinverse  is 

given  by  Z*  -  where  2*  is  a  diagonal  matrix 

containing  the  reciprocals  of  the  singular  values  (ex¬ 
cept  for  singular  values  equal  to  zero,  which  remain 
zero).  The  modified  Z  is  given  by  Z  =  U2V^,  where  2 
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identical  to  1  except  that  the  small  sin^lar  values 
ivebeensettoO.  We  denote  the  remit  of  2  as  We 
le  that  ZZ+  =  =_UiU^,  where  I  is  a 

lagonal  matrix  with  the  first  R  diagonal  elements 
jual  to  1  and  the  remaining  elements  equal  to  0. 
learly,  UlU^,  and  hence  ZZ*,  are  symmetric  and 
asitive  semidefinite.  We  also  see  that  (Z2''^)(Z2+)  = 
iu^iu^  =  Uiu^  =  ZZ+,  so  Z2+  is  idempotent. 
^e  now  use  these  matrix  properties. 

To  show  that  the  algorithm  converges,  we  show  that 
Jnll'  ~  lle„  +  ill'  is  positive.  Substituting  Eq.  (Al)  into 
q.  (A2)  gives 

e,  =  (ZZ"  -  I)b„.  (A6) 

ubstituting  Eq.  (A4)  into  Eq.  (A6)  gives 

c,,.,  =  e„  +  2p(ZZ’ -  I)e„.  (A7) 

aking  1/4  of  the  squared  norm  of  each  side  of  Eq.  (A7) 
ields 
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he  first  term  on  the  right  hand  side  of  Eq.  ( A9)  equals 
l|e„IP  because  the  negative  e„  elements  are  multiplied 
y  the  Bn  elements  that  are  0.  The  second  term  re- 
uces  to  0.  To  show  this,  substitute  Eq.  (A6)  for  Cn 
»to  this  term  to  obtain 

-peJzZ'e,  =  -fihlal*  -  I)ZZ*e„ 

=  pbI(2Z*  -  ZZ')e„  =  0.  (AlO) 

here  the  second  equality  uses  the  fact  that  Z2'*^  is 
iempotent.  The  third  term  can  be  expanded  as 

-pVr((Z2*)^(ZZ*)  -  (ZZ*)^  -  ZZ*  +  lie-;.  (All) 

ince  Z2^  is  symmetric  and  idempotent.  the  term  sira- 
lifies  to 

-pV„^(I -Z2*)e„.  (A  12) 

’ith  these  simplifications,  Eq.  ( A9)  can  now  be  rewrit- 
n  as 

V.dlejr'  -  I>,.,ll-)  =  p(l  -  P)liejl''  -t-  pVlzZ^e,.  (A;3) 

he  quantity  !ie,ll‘  is  strictly  positive  (it  can  be  zero 
i!y  when  the  algorithm  has  terminated)  and  the  sec- 
id  term  on  the  right  hand  side  is  nunnegative  since 
t*  is  positive  semidefinite.  Therefore,  the  algorithm 
guaranteed  to  converge  if  p(l  -  p)>  0,  or  for  0  <  p  < 
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ABSTRACT 

We  consider  the  classification  of  multiple  objects  in  a  scene  with  distortion  and  clutter 
present.  Our  opinions  on  the  role  for  neural  nets  (NNs)  in  this  application  and  the 
different  properties  that  NNs  must  have  to  address  this  problem  are  advanced.  A 
hierarchical/inference  approach  is  suggested  using  correlation  NNs  for  low-level 
operations  and  new  classifier  NNs  with  higher-order  decision  surfaces  for  the  final 
decision  NNs.  Our  concern  is  NN  capacity  and  performance  (in  noise).  Our  capacity 
guidelines  advanced  concern  the  number  of  neurons,  use  of  analog  neurons,  Ho- 
Kashyap  (HK)  NNs,  and  two  new  NNs  with  higher-order  decision  surfaces.  Our  noise 
performance  guidelines  advanced  concern  the  number  of  netiron  layers,  hidden-layer 
neuron  encoding,  and  robust  HK  NNs. 

1.  INTRODUCTION 

For  the  demanding  problem  considered,  we  fed  that  even  NN  solutions  should  use  a 
hierarchical/inference  approach.  The  levels  in  such  a  system  [1]  are  shown  in  Figure  1 
and  discussed  briefly  in  Hgure  2.  Subsequent  sections  address  the  role  for  NNs  in  each 
levd  and  the  different  properties  required  in  each  levd  (hence  our  use  of  a 
hierarchical  approach,  as  is  us^  in  ATR  [2]).  Extorsive  use  is  made  of  prior  work  since 
much  of  it  does  not  seem  to  generally  be  appreciated,  possibly  due  to  the  vast  quantity  of 
NN  literature.  Carnegie  Mellon  work  is  emphasized,  since  we  are  most  familiar  wi^  it 
and  since  it  has  addressed  the  problem  we  consider. 

2.  CORRELATION  NNs 

These  are  used  for  the  detection,  enhancement  and  feature  extraction  levels  in  Figure 
1.  The  detection  NN  is  the  lowest-levd  processor.  It  operates  on  the  entire  scene  and  its 
fimction  is  to  locate  candidate  regions  of  interest  (ROIs).  Since  this  level  requires 
handling  object  distortions,  multiple  objects,  and  clutter,  we  do  not  attempt 
discrimination  (identification)  initially.  With  mtdtiple  objects  present,  a  parallel  solution 
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requires  shift-invariance  (SI).  With  clutter  present,  a  parallel  solution  requires  a  large 
spatial  set  of  weights  (space  bandwidth  product)  and  hence  a  correlator  (for  processing 
gain).  Figure  3  shows  ti\e  standard  2-layer  NN  with  shift  invariant  Fourier  transform 
(FT)  interconnections.  The  weights  at  P2  are  applied  to  every  Pj  neuron  in  parallel  and 
the  P3  outputs  are  the  weighted  sum  of  the  product  of  the  weights  and  each  input  region 
with  a  P3  nonlinearity  (threshold  etc.)  applied.  This  is  a  NN  version  of  the  standard 
correlator  (Figure  4)  and  hence  we  refer  to  it  as  a  correlation  NN.  Alternative  NNs  use 
Nj^  intercormections  (there  are  N|  input  Pj  neurons)  to  achieve  [3]  SI.  With  Nj  =  10^ 
iconic  nevirons,  this  is  excessive  and  free  space  SI  FT  interconnections  and  FT  weights 
clearly  appear  preferable  to  N|^  interconnections  [4]  when  SI  is  required. 

2.1  DETECTION  NN 

We  use  hit-miss  (H-M)  weights  (filters)  or  rank-order  filter  techniques  applied  to  the 
input  scene  and  its  complement,  threshold  the  two  P3  outputs,  and  intersect  the  H  and  M 
P3  outputs  to  achieve  detection.  Figure  5  shows  an  example  of  the  detection  of  7  input 
ROIs  with  this  technique.  It  handles  hot,  cold  and  bimodal  objects  as  seen.  The  initial 
algorithm  has  been  detailed  [5],  demonstrated  (6]  and  its  advanced  variations  [7] 
performed  very  successfully  in  a  wide  range  of  strong  background  clutter. 

2.2  ENHANCEMENT  NN 

Prior  to  attempting  classification  of  the  object  in  each  ROI,  it  is  generally  advisable  to 
enhance  each  ROI.  This  involves  noise  removal,  filling  in  holes  on  the  object,  edge 
enhancement,  etc.  Since  the  location  of  the  object  in  each  ROI  is  not  known,  enhancement 
requires  SI  jmd  thus  we  use  our  correlation  NN  (Figure  3).  Now  N^  is  smaller  (only  the 
ROI  pixels  eue  input  to  Pi).  The  P2  weights  used  are  now  morphologically  inspired  and 
are  simple  uniform  structuring  element  (SE)  filters  (disks  etc.).  The  spatial  size  of  each  SE 
weight  function  determines  the  size  of  the  holes  filled  in  on  the  object  and  the  size  of 
noise  regions  omitted.  The  P3  netuon  thresholds  used  define  the  operation  performed:  a 
low  threshold  yields  a  dilation  emd  a  high  threshold  yields  an  erosion.  The  difference 
between  dilation  and  erosion  images  yields  an  edge  enhanced  image.  Figure  6  shows 
examples  of  these  operations.  Some  similar  operations  can  also  be  achieved  using  NN 
retina  chips  [8]  etc.  Their  correlation  NN  realization  and  the  many  operations  possible 
(besides  those  shown  in  Figure  6)  are  detailed  elsewhere  (91. 

23  FEATURE  EXTRACTION  NN 

Prior  to  classifying  an  object,  features  are  generally  extracted  to  describe  each  ROI. 
Since  the  location  (or  even  the  presence)  of  an  object  in  each  ROI  is  not  known,  SI  is  again 
required.  We  thus  prefer  to  again  use  the  correlation  NN  (Figure  3)  for  feature  extraction. 
In  this  case,  the  NN  weights  at  P2  are  chosen  using  computer  generated  hologram  (CGH) 
techniques,  which  can  achieve  a  larger  number  of  different  feature  spaces  at  the  P3 
neuron  outputs  [10].  Many  NNs  have  been  described  that  can  calculate  features  such  as 
edges,  moments.  Hough  transforms,  etc.  However,  we  see  no  reason  to  use  such  NNs 
versus  standard  correlator  or  VLSI  chips  for  these  purposes.  NNs  can  also  conceptually 
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FIGURE  1:  Hierarchical  inference  levels  of  scene  analysis. 


1)  DETECTION 

MULTIPLE  OBJECTS,  SHIFT-INVARIANT  WEIGHTS 
LOCATE  REGIONS  OF  INTEREST  (ROIs) 

2)  IMAGE  PROCESSING  (ENHANCEMENT) 

REDUCE  NOISE,  FILL  IN  HOLES,  EDGE  DETECTION 

3)  FEATURE  EXTRACTION 

4)  IDENTinCATION 

DETERMINE  CLASS  OF  OBJECT  IN  EACH  ROI 

HIGHER  ORDER  MORE  COMPLEX  NN  DECISION  SURFACES 


FIGURE  2:  Remarks  on  levels  in  Figure  1. 
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FIGURE  3:  Shift  invariant  correlation  NN. 


FIGURE  4:  Standard  correlator  (optical). 
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(a)  Input  (b)  Output 

FIGURE  5:  Detection  NN  example  results. 


(c)  Dilation  followed  by  erosion 
(Fill  in  holes) 


(d)  Edge  enhancement 


(a) Input 


(b)  Erosion  followed  by  dilation 
(Noise  removal) 


FIGURE  6:  Image  enhancement  NN  example. 

be  used  to  compute  a  set  of  best  features  (with  no  a  priori  choice  of  the  feature  space).  No 
ideal  NN  for  this  has  yet  emerged.  Candidate  solutions  such  as  the  Neocognitron  [11] 
use  an  excessive  number  of  NN  layers  and  other  solutions  such  as  the  ART  [12]  require 
complex  individual  neuron  elements  (with  a  parallel  array  of  such  elements  for  each 
input  pixel). 


2.4  UNIFIED  MULTIFUNCTIONAL  NN 

An  attractive  aspect  of  our  NN  approach  to  the  first  3  processing  levels  noted  in 
Figure  1  is  that  the  same  correlation  1^  architecture  is  used  witlr  different  Pj  input 
neurons  (the  full  scene  or  an  ROI)  and  different  weights  at  P2  (H-M,  SE,  CGH  choices). 
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3.  NN  CAPACITY 


The  classifier  NN  we  consider  is  the  standard  3-layer  NN  in  Figure  7  (i.e.  with  one 
hidden  layer  of  neurons  at  P2).  The  notation  we  use  considers  Njinput  neurons 
(features)  at  P^,  N2  hidden  layer  neurons  at  P2  and  Nsoutput  neurons  at  P3  (N3  is  the 
number  of  object  classes,  although  P3  encoding  can  represent  more  than  N3  classes).  We 
denote  the  number  of  training  set  images  (vectors)  by  N  j. 


CLUSTERS  CLASS 

SEVERAL  PER 
CLASS 

FIGURE  7:  Basic  3-layer  classification  NN  considered. 

3.1  NUMBER  OF  LAYERS 

A  3-layer  NN  can  produce  any  decision  surface  [13,14]  and  hence  is  used.  The  proof 
of  this  requires  more  N2  neurons  to  better  approximate  higher-order  srufaces.  The  new 
higher-order  NNs  we  consider  (Section  4)  allow  higher-order  surfaces  and  exact  (not 
approximate)  higher-order  surfaces  with  few  N^  neurons.  When  noise  is  considered,  one 
can  show  [15]  that  performance  degrades  iwth  more  neuron  layers  since  errors 
propagate.  We  also  consider  only  3-layer  NNs  since  we  have  not  yet  found  a  good 
algorithm  (wifitout  ad  hoc  parameters)  for  training  NNs  with  more  than  one  hiddoi 
layer.  If  significant  training  is  performed,  4-layer  IWs  may  be  able  to  train  out  noise 
propagation  effects.  However,  flayer  NNs  are  clearly  preferable  if  they  perform  wdl  (as 
ours  do),  as  they  are  more  wdl  de^ed  and  less  ad  hoc.  Work  on  mapping  decision  trees 
into  NNs  [16]  is  not  considered  as  it  leads  to  many  neural  layers  and  often  binary 
neurons,  it  scales  poorly  with  increased  size  and  is  contrary  to  parallel  and  NN  concepts. 
Decision  trees  also  classically  use  one  cluster  per  dass  (while  we  find  that  several  N2 
neurons  per  dass  is  preferable). 

3.I-ANALQG  NEURONS  AND  WEIGHTS 

High  capadty  requires  analog  input  Pj  neurons  and  analog  weights  from  Pi-P2- 
Polarization  methods  [17]  to  achieve  optical  bipolar  data  (in  optical  NNs)  are  restricted 
to  only  binary  neurons  and  weights  and  require  hard  dipp^  output  neurons.  Thus, 
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they  are  not  of  use.  Since  image  pixels  and  features  are  analog,  the  analog  Pi  neuron 
requirements  are  essential. 

3.3  NUMBER  OF  INPUT  NEURONS 

As  Ni  increases,  so  does  capacity.  However,  numerical  stability  is  now  of  concern  [18] 
(i.e.  the  defined  problem  is  ill  conditioned).  This  also  translates  into  increased  required 
numeric  accuracy  in  the  Pi  neurons  and  the  P1-P2  weights.  A  larger  Ni  also  requires  a 
larger  training  set  (Nj)  and  the  curse  of  dimensionality  [19]  then  enters.  With  many  Ni 
features,  some  features  will  be  of  little  use  (small  variance  for  the  different  classes)  and 
other  features  will  have  large  variances.  Thus,  a  larger  Ni  is  good  for  capacity,  but 
introduces  considerable  practical  problems  (noise,  instability,  accuracy  requirements,  a 
large  Nj,  many  local  minima).  Our  hierarchical  approach  relieves  these  problems,  but 
one  should  still  restrict  Nj  (our  use  of  feature  space  P^  neurons  addresses  this). 
Numerical  accuracy  requirements  clearly  increase  with  Nj  (they  also  increase  with  N2, 
but  the  affect  seems  much  less). 

3.4  NUMBER  OF  HIDDEN  LAYER  NEURONS  Nj 

3.4.1  Approaches 

No  general  solution  yet  exists  to  this.  However,  it  seems  well  worthwhile  to  advance 
remarks  on  various  approaches  to  determining  N2.  A  number  of  papers  have  established 
a  bound  of  N2  =  Nj  -  1,  but  this  assigns  one  neuron  to  nearly  each  of  the  Nx  training 
image  and  N2  is  too  large  to  be  practical.  Other  derivations  make  unrealistic 
assumptions  not  valid  for  general  data  that  contains  distorted  images,  etc  Neural  nets 
that  p^orm  piecewise  linear  input-to-output  mapping  generally  use  a  large  N2  »  Nj  - 1 
and  thus  memorize  and  perform  a  desired  functional  mapping  (this  is  not  the  problem 
addressed  in  our  classifier  NNs).  Many  methods  simply  increase  or  decrease  N2  until 
adequate  performance  is  obtained.  We  desire  non  ad  hoc  methods  and  a  good  initial  N2 
choice  (el^  training  time  is  excessive,  optimization  is  not  necessarily  obtained  and 
comparisons  are  not  easily  possible).  Techniques  [20]  that  select  the  number  of  N2 
neurons  per  class  based  upon  the  a  priori  probability  of  each  class  occurring  did  not 
perform  well  (we  attribute  this  to  the  fact  that  the  number  of  N2  neurons  per  class  should 
be  based  on  how  disjoint  a  class  is  and  how  similar  two  classes  are).  Me^ods  which  use 
linear  algebra  [21]  and  covariance  [22]  techniques  to  calculate  subspaces  etc.  cannot  be 
applied  to  large  Nj  and  Nj  cases  [21,22]  and  often  [21]  apply  only  to  binary  neurons. 
Pnming/removing  weights  does  not  [23]  always  yield  reliable  results  and  causes  data  to 
become  linearly  inseparable;  thus  we  ignore  such  methods  and  note  that  decreasing  N2 
is  not  yet  easy  and  adding  additional  layers  [23]  when  reducing  N2  causes  other 
problems. 

3.4.2  General  N2  Remarks 

We  now  advance  several  obvious  (but  not  quantitative)  general  remarks  on  N2.  Small 
N2wiU  not  allow  the  problem  to  be  solved.  Excessive  N2  results  in  memorizing  the 
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training  data  and  generalization  (good  test  data  results)  does  not  occur  (and  local 
minin\a  may  arise).  With  N2  neurons,  one  has  N2  regions  in  decision  space  and  can  form 
a  maximum  of  (N2)(N2-1)  decision  boundaries  (lines),  one  between  each  N2  pair.  In 
practice,  the  number  of  decision  boundaries  is  much  less  as  many  are  not  of  use  (such  as 
ones  between  two  N2  neurons  in  the  same  dass).  No  dear  relationship  between  N2  and 
the  number  of  dasses  N3  can  generally  be  obtained.  Some  insight  into  how  N2  relates  to 
dedsion  surfaces  [221  exists,  but  no  dedsion  emerges  in  general.  Clearly  capadty 
increases  as  N2  increases  and  N2  increases  as  and  N3  increase.  N2  relates  to  capadty 
and  dedsion  surfaces.  Our  new  higher-order  decision  surface  NNs  (Section  4)  address 
this  problem.  They  provide  improved  with  fewer  N2  neurons  and  thus  are  preferable. 
These  methods  are  best  seen  by  attention  to  linear  discriminant  hmctions  and  how  NNs 
produce  decision  surfaces.  Section  4  addresses  these  issues. 

3.4.3  Preferable  Approach  (Prototypes) 

The  technique  we  use  in  the  Adaptive  Clustering  NN  (ACNN)  [4]  to  select  N?  uses 
prototypes  obtained  from  the  training  set.  This  concept  has  long  been  used  [24,^1  to 
extend  linear  dassifiers  to  piecewise  linear  ones.  The  linear  vector  quantizer  [26]  (LVQ) 
uses  the  PDF  of  each  dass  and  the  data  to  sdect  N2.  This  results  in  a  larger  N2  than  in  the 
ACNN.  In  our  NN,  we  are  concerned  with  dedsion  surfaces  not  with  modeling  data 
PDFs  (spedfically,  our  concern  is  with  the  boundaries  between  data  dusters  not  the 
means  of  such  clusters).  We  fed  it  is  important  to  not  pick  initial  prototypes  at  random  or 
uniformly  distributed  (as  LVQ  does)  as  this  yields  larger  N2.  We  also  feel  that  use  of 
more  than  one  prototype  per  class  is  needed,  but  these  should  not  be  arbitrarily  used. 
Rather,  they  should  be  us^  when  dass  data  is  disjoint  and  discrimination  between  two 
similar  dasses  is  needed.  This  is  somewhat  rdated  to  K-nearest-neighbor  (KNN) 
dassifiers  (but  our  motivation  is  different  and  no  NN  as  yet  performs  KNN  -  rather 
many  KNN  designs  are  presented  as  NNs). 

Our  dustering  techiuque  [4]  to  sdect  N2  allows  a  rapid  analysis  of  the  training  set 
followed  by  a  second  sdection  process  to  pick  the  best  prototype  (more  than  one  per 
dass).  These  are  chosen  to  both  represent  the  data,  but  primarily  to  discriminate  dasses 
(i.e.  we  consider  the  boundaries  between  dusta*s  rather  than  the  mean  of  each  cluster). 
This  choice  (rather  than  random  prototypes)  and  the  use  of  more  than  one  per  dass 
distinguishes  our  method  from  others.  We  use  these  prototypes  to  set  initial  P1-P2 
wdghts  and  then  we  adapt  them.  Our  initial  wdghts  and  adaptation  of  them  differ  from 
other  prototype  methods.  We  indude  an  additional  Pj  neuron  with  input  1  and  weights 
^  NN  algorithm  adapts,  this  constraint  on  the  added  N|  neuron  is  not 
enfotc^  (in  LVQ  it  is  s^  enforced).  In  LVQ,  this  forces  dedsion  boundaries  to  lie  on  the 
perpendicular  bisector  (midpoint)  between  prototypes.  In  the  ACNN,  the  dedsion 
surface  need  not  lie  at  the  midpoint.  Thus,  the  ACNN  yields  more  flexibility  and  better 
Pr-  results  with  fewer  N^  neurons. 

3.4.4  Prototype  Extensions 

Care  must  be  taken  to  not  select  N2  too  large.  We  generally  achieve  a  correct 
classification  of  only  P^  **  50%  with  the  initial  N2  prototypes.  If  a  small  N2  yields  high 


7 


PC/  then  a  NN  may  not  be  needed.  However,  test  results  may  be  poor  with  this  N2choice. 
One  can  check  test  set  performance  eveiy  few  hundred  adaptations  to  verify  this  and 
lower  N2  and  restart  if  needed.  If  N2  is  too  small,  the  NN  adaptations  provide  this 
information  and  we  then  increase  N2.  To  remove  dependence  on  the  order  in  which  the 
training  data  is  presented,  we  train  in  batch  mode.  We  avoid  outliers  by  our  N2  selection. 
If  the  input  image  training  data  contains  many  artifacts,  it  is  essential  to  not  overtrain  in 
NN  adaptations.  If  noise  is  present  in  the  training  data,  then  distributions  can  help  avoid 
noise  effects.  Training  on  noisy  data  is  now  needed.  Use  of  PDFs  can  be  of  use  here,  but 
proper  selection  of  N2  yields  similar  results  with  a  preferable  algorithm.  In  general,  one 
should  not  add  additional  N^  neurons  to  handle  several  training  images  that  are  a  small 
percentage  of  a  class  or  of  Nj 

With  a  good  initial  N2  estimate  (as  above),  we  can  easily  increase  N2.  The  new 
N2neuron  added  would  be  one  to  handle  the  dass  or  discrimination  between  dasses 
needing  assistance.  Our  dustering  provides  data  on  the  next  best  such  prototype.  (Dur 
N2  selection  method  is  preferable  to  N2  choices  based  on  the  PDF  of  each  dass,  since  we 
consider  discriminating  dasses  not  modeling  of  a  dass.  (i.e.  attention  to  dass  boundaries 
not  dass  means)  and  we  can  far  more  easily  choose  a  new  N2  neuron  and  its  weights. 
The  new  prototype  sets  these  weights  and  they  are  not  random  as  others  [27]  use;  all 
other  prior  weights  are  kept  unchanged.  We  know,  ftx)m  the  NN  results,  the  dasses 
needing  help  and  from  dustering  we  know  the  prototypes  to  add. 

3.S-HQrKASHYAE.(HK).NNs 

For  the  best  NN  storage,  we  can  and  have  [28,29]  quantified  the  best  algorithm  to 
select  the  P1-P2  weights.  For  inputs  in  general  position,  this  is  the  HK  NN  algorithm.  It 
handles  dependent  inputs  and  yields  storage  of  N3  =  TN^  dasses.  This  is  theoretically  the 
maximum  (using  WTA  P2neuror\s).  We  have  shown  [28]  that  the  best  results  with  input 
noise  occur  for  the  robust  HK  algorithm.  We  have  also  shown  that  this  yields  adequate 
analog  input  neiuon  accuracy  and  P1-P2  wdghts  when  limited  analog  accuracy  is 
present.  TWs  is  relevant  to  optical  and  analog  VLSI  NNs.  The  noise  control  parameter 
Osyn  in  the  HK  algorithm  is  selected  to  optimize  such  noise  conditions  and  accuracy 
limitations.  We  found  that  L-max  P2neuron  encoding  yields  the  best  P^  in  noise  etc. 
These  results  do  not  seem  to  be  widely  known.  Our  recent  [30]  results  show  that  with 
this  algorithm  we  can  also  train  out  internal  NN  accuracy  and  non-ideal  device  effects. 
We  have  consistently  achieved  better  P^  results  with  this  algorithm  than  with  an  infinite 
accuracy  NN  quantized  (after  the  fact)  to  lower  input  and  weight  accuracy. 

3.6  PERFORMANCE 

The  performance  we  consider  is  the  percent  correct  dassification  P^  of  the  input  data. 
To  be  meaxiingful,  one  must  obtain  P^  >  90-95%  (50-70%  performanre  is  not  of  general 
use)  and  this  must  be  obtained  with  input  noise  and  non-training  test  data  with  N3  >  N^ 
storage.  It  is  also  crudal  to  be  able  to  modify  the  same  NN  to  maximize  P^  and/or 
minimize  Pg  (probability  of  error).  We  will  show  how  this  is  possible  (Section  4).  To 
improve  P^  and  Pg  performance,  higher-order  dedsion  surfaces  are  needed.  Section  4 
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addresses  new  NNs  to  achieve  this  (we  achieve  exact,  not  approximate,  surfaces  without 
the  need  for  large  N2).  Another  issue  of  concern  in  NNs  is  training  time.  Our  use  of 
conjugate  gradient  methods  [4]  is  much  faster  (by  a  factor  of  100  to  1(X)0)  than  standard 
gradient  descent  or  delta  rule  neuron  update  algorithms.  This  result  does  not  seem  to  be 
widely  known. 

4^NN.PECISIQN  SURFACES 

4.1  PIECEWISE  LINEAR  DECISION  SURFACES 

We  produce  piecewise  linear  discriminant  surfaces  by  using  initial  P1-P2  weights  that 
are  linear  discriminant  functions  (LDFs)  set  by  N2  exemplars  and  using  NN  techniques 
to  adapt  and  combine  these.  Figure  8  shows  [4]  an  example  for  Nj  =  2  input  neurons  and 
N3  =  3  classes  using  N2  =  6  bidden  layer  neurons.  As  seen,  the  3  classes  of  data  are 
separated  and  nonlinear  surfaces  are  required  to  achieve  this.  We  include  an  additional 
input  neuron  whose  input  is  1  and  whose  weights  are  related  to  the  sum  of  the 
squares  of  the  other  weights.  This  allows  the  threshold  for  each  P2  neuron  to  be 
separately  adapted.  This  allows  each  line  in  Figure  8  to  be  shifted  (i.e.  it  need  not  pass 
through  the  origin)  and  this  (together  with  our  non  ad  hoc  N2  selection)  avoids  local 
minima.  This  al^  insures  nearest  neighbor  convergence  (the  closest  P2  prototype  neuron 
is  the  most  active  one).  We  note  that  a  nonlinearity  at  neuron  layer  P2  is  essential 
otherwise  a  linear  NN  results,  which  is  merely  a  standard  pattern  recognition  LDF  and 
not  a  true  NN.  We  now  discuss  two  other  techniques  using  only  1  or  2  additional  input 
neurons  that  allow  higher-order  decision  surfaces.  These  methods  are  preferable  to 
others  that  use  an  excessive  number  of  interconnections  to  provide  higher-order 
weights. 


FIGURE  8:  Piecewise  noi\lineai  decision  surfaces. 


4.2  PIECEWISE  HYPERSPHERICAL  NN  [311 

This  concept  is  best  shown  for  the  case  of  =  2  two  input  neiuons  with  an  input 
described  by  x'  =  [xiX2l^.  We  add  2  neurons  and  use  Nj  =  4  neurons 
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2c  =  [1  (x^  +  x^  X,  XjC,  and  we  denote  the  weights  to  hidden  layer  neuron  n  by  w^  = 
[W1W2W3W4]  .  The  decision  boundary  for  each  P2  neuron  is  x^Wn  =  0  or 
wj  +  W2  (>^  +  Xj)  +  W3X,  +  w^Xj  =  0.  As  seen  ^is  describes  a  drcle.  It  also  describes  a  line  (if 
W2  =  0).  If  Nj'  >  3,  it  is  a  hypersphere.  Each  P2  neuron  can  now  describe  a  hypersphere 
and  the  decision  surfaces  are  piecewise  hyperspheres.  To  design  this  NN,  we  add  several 
N2  neurons  per  class  (as  needed)  to  separate  one  class  from  the  others.  To  use  this  neural 
net,  we  simply  look  at  the  signs  of  the  P2  neurons.  If  each  neuron  is  >  0  (<  0),  the  sign 
denotes  if  the  corresponding  input  is  inside  (outside)  the  sphere  for  that  N2  neuron.  Note 
that  each  hypersphere  is  easily  designed  in  the  new  N|  space  as  an  LDF  but  when  used 
in  the  original  space  is  a  hvpersphere 

Figure  9  shows  an  example  of  the  decision  surfaces  produced.  We  consider  3  classes 
of  data,  Nj  =  4  neurons  and  N2  =  6  hidden  layer  neurons.  This  NN  produces  N2  =  6 
circles  (Ci-<Z!3  for  class  1  etc)  that  define  12  regions  of  space  (R1-R3  for  class  1,  etc.,  and 
R10-R12  in  which  no  test  data  lies).  Hie  signs  of  the  N2  =  6  neurons  specify  the  location  of 
an  input  in  R1-R12  as  inside/outside  each  hypersphere.  Figure  10  shows  the  3  class 
boundaries  defin^  by  piecewise  h)rperspheres.  Note  that  an  input  lying  in  Rio"Ri2 
be  assigned  no  decision  or  classifi^  as  a  reject  dass.  If  a  WTA  is  used  at  P2,  then  hard 
dedsions  (no  reject  dass)  result  and  the  decision  boundaries  in  Figure  11  occur  (with  no 
reject  dass).  These  Figure  11  surfaces  yield  much  better  Pc  and  the  surfaces  in  Figure  10 
yield  much  better  Pg  as  we  will  show  (Section  4.4). 


FIGURE  9:  Twelve  regions  R  defined 
by  3  cirdes  in  the  hypeispherical  NN. 


FIGURE  10:  Three  dass  (and  rejed  dass) 
decision  regions  from  Figure  9. 
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4.3  PIECEWISE  HYPEROUADRATICNN  [321 


This  provides  even  more  flexibility  in  the  decision  surfaces  produced  and  allows 
better  and  Pg.  This  uses  one  additional  input  neuron  <  =  (i  x,  Xjl’  complex-valued 
weights  w,j  =  w,j  +  jVjj  (these  are  easily  produced  optically)  and  nonlinear  intensity 
detection  at  P2  (as  optics  provides).  For  Ni  =  3  input  neurons  and  only  N2  =  2  hidden 
layer  neurons,  the  N2  =  2  outputs  are 


fl(x)  =  |lw„X,  +  W,2X2  +  Wy||^ 

f2(»)  =  ll'W2l’‘l  +  W2254  +  W23ll^ 


and  the  decision  surface  produced,  fi(2^  =  f2(x)/  is 


ax^  +  bxj  +  cxjXj  +  dx, +  ex2+f  =  0 


As  seen,  each  pair  of  P2  neurons  in  this  NN  can  produce  any  quadratic  surfece 
required.  A  line  results  (if  a  =  b  =  c  =  0),  a  drcle  or  an  ellipse  (a  *  b)  results  (if  c  =  d  =  e  = 
0),  etc  These  surfaces  are  exact  (not  approximate  as  in  BP  etc.).  As  before,  they  arc  easily 
calculated  as  LDFs  in  the  new  space  and  are  hyperquadratic  (if  Nj  >  2)  in  the  original 
space  (in  which  they  are  used).  Figure  12  shows  some  of  the  various  surfaces  possible 
(the  NN  algorithm  and  the  data  determine  the  complexity  required  in  each  surface). 

4.4  INITIAL  RESULTS 

We  show  results  for  a  real  multiclass  identification  problem  with  severe  distortions 
present  We  consider  3  classes  of  aircraft  with  several  ^80®)  pitch  and  roll  distortions. 
Figure  13  shows  several  distorted  inputs  for  one  object  class  (DCIO).  We  used  Ni  =  34 
neurons  (32  wedge  Fourier  samples  and  2  additional  neurons)  md  N2  =  24  hidden  layer 
neurons.  The  decision  surfaces  produced  were  piecewise  nonlinear  combinations  of  24 
different  hyperspheres  (N2  =  24).  The  training  set  used  consisted  of  3267  distorted 
inputs.  The  test  set  was  3072  other  distorted  inputs  not  present  in  the  training  set.  The 
test  results  (Table  1)  are  excellent.  They  show  the  flexibility  of  the  NN  to  optimize  P/-^  or 
Pg  for  a  given  problem.  If  we  desire  lai^e  Pq  we  use  maximum  selection  (WTA  neurons) 
and  achieve  P^  =  98.9%.  If  we  desire  low  P^,  we  use  thresholded  (  <  0)  neurons  and 
achieve  Pg  =  0.4%.  These  results  are  most  impressive  considering  the  severe  distortions 
present  and  the  few  N|  and  N2  neurons  used. 
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FIGURE  11:  Class  boundaries  produced 
by  WTA  nexirons  in  Figure  9. 


V  ^  ^ 


FIGURE  13:  Representative  DC-10 
distorted  input  images. 


FIGURE  12:  Decision  surfaces  produced  on  the  hyperquadratic  NN. 


THRESHOLD 

MAX  SELECTION 

Pc(%) 

Reject  (%) 

Pe(%) 

Pc(%) 

Pe(%) 

TVaining  Set 

93.9 

5.7 

0.4 

98.9 

1.1 

Test  Set 

92.5 

6.8 

0.7 

98.1 

1.9 

Table  1:  Test  results  of  the  hyperspherical  NNs  for  high  (max  selection  WTA 
neiurons)  and  low  (thresholded  (  <  0)  neurons. 
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For  completeness,  we  review  some  of  our  HK  NN  results  [28].  The  storage  N3/N1  = 
(number  of  classes)/ (number  of  input  neurons)  for  which  Pc  >  95%  can  be  quantified  for 
various  associative  processors.  The  Hopfield  NN  (using  a  correlation  matrix)  yields  poor 
N3  =  0.12N|  storage  (N3  «  Nj),  the  pseudoinverse  solution  (using  standard  linear 
algebra  not  NNs)  is  much  better  (N3  =  1.04Ni  or  N3  ~  N^)  and  the  HK  NN  is  best  (N3  = 
1.52N|).  When  input  noise  is  present  (o^  =  0.1),  all  NNs  degrade.  We  ignore  the  Hopfield 
NN  since  its  performance  is  too  poor  (and  no  variations  of  it  can  sufficiently  improve 
results).  For  Pc  >  95%,  ^he  standard  HK  NN  can  now  only  store  N3  =  0.75Ni  (this  is  still 
better  than  the  linear  algebra  pseudoinverse  solution),  while  the  robust  HK  NN  is  better 
(N3  =  0.8N|).  However,  better  results  (equal  to  the  noise  free  ones)  with  N3  >  Nj  are 
obtained  with  L-max  (L  =  2)  encoding  where  N3  =  1.5N|  (no  loss)  is  obtained.  Our 
analog  accuracy  tests  are  also  most  impressive  and  show  that  analog  accuracy  NNs  can 
achieve  better  performance  than  infinite  accuracy  NNs  when  trained  on  limited  accuracy 
data  using  the  robust  HK  NN,  In  the  HK  NN  synthesis,  we  now  use  =  2'®/12'^^  for  a 
B-bit  accuracy  NN.  With  mfinite  accuracy,  we  obtained  Pc  =  82.3%  and  the  same 
performance  when  the  inputs  etc.  to  this  NN  were  quantized  to  8  bits.  However,  our 
robust  HK  NN  with  Ogyn  and  with  training  on  8  bit  data  yielded  much  better  (90.6%) 
results. 


TEST  RESULTS 

TEST  SET  1  =  ±50°  DISTORTIONS  (Nj  =  882) 

TEST  SET  2  =  ±60°  DISTORTIONS  (Nt  =  1250) 

NEURAL 

NET 

USED 

TEST-SET  1 

TEST-SET  2 

TRAIN 

TEST 

TRAIN 

TEST 

STDHK 

93.2 

92.3 

82.3 

88.5 

ROBUST  HK 

95.1 

95.4 

90.6 

91.2 

Table  2:  HKNN  tests. 


Very  significant  multiclass  (3  different  aircraft)  recognition  with  severely  distorted 
inputs  (±50°  and  ±60°  distortions  in  pitch  and  roll)  were  obtained.  The  NN  used  Nj  =  33 
input  neurons  and  only  N2  =  N3  =  2  neurons.  Table  2  shows  the  test  results  obtain^.  As 
seen,  the  robust  HK  NN  provided  2-8%  better  results.  These  very  impressive  results  are 
highlighted  in  Table  3. 


3-D  DISTORTION-INVARIANT  MULTICLASS IDENTIHCATION 
33  INPUT  NEURONS,  2  HIDDEN  LAYER  NEURONS 
RECOGNIZE  Pc'  >  95%  OF  OVER  N3=  2000  INPUTS  >  70Ni 
IN  NOISE  =  0-^)  WITH  6  BIT  (1%)  ACCURACY  NN 

Table  3:  Very  impressive  accuracy  and  noise  Table  2  results  summarized. 
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Abstract 

A  neural  network  pattern  classifier  is  presented.  Its  decision  boundaries  are  formed 
from  segments  of  conic  sections  which  allows  it  to  achieve  improved  performance  over 
piecewise  linear  neural  network  classifiers,  such  as  our  earlier  adaptive  clustering  neural 
network  (ACNN).  We  discuss  an  optical  realization  that  uses  complex-valued  weights,  op¬ 
tical  intensity  detectors,  and  an  additional  input  neuron  to  achieve  piecewise  conic  decision 
surfaces  (rather  than  the  piecewise  linear  surfaces  that  the  ACNN  produces). 


1  Introduction 

Neural  networks  (NNs)  have  the  ability  to  produce  arbitrarily  complex  decision  boundaries 
[1]  in  an  organized  and  efficient  manner.  This  makes  them  very  attractive  for  difficult  multiclass 
classification  problems.  In  this  paper,  we  extend  our  earlier  ACNN  algorithm  [2]  (which 
provided  piecewise  linear  discriminant  surfaces)  to  more  complex  piecewise  quadratic  decision 
surfaces  (thereby  improving  recognition  percentage  Pc). 

The  ACNN  [2]  has  several  attractive  properties.  For  example,  it  requires  few  ad  hoc 
parameters  to  be  sdected.  It  uses  pattern  recognition  and  linear  discriminant  function  (LDF) 
techniques  to  select  initial  weights.  It  then  uses  neural  network  optimization  techniques  to 
refine  the  initial  weights  to  produce  combinations  of  LDFs,  forming  piecewise  linear  decision 
surfaces,  and  it  converges  much  faster  than  the  standard  benchmark,  the  backpropagation 
training  algorithm.  We  now  improve  upon  the  ACNN  with  the  piecewise  quadratic  neural 
network  (PQNN). 

Many  researchers  have  noted  the  parallelism  and  interconnection  advantages  of  optical  NNs. 
We  address  many  new  and  practical  issues  assodated  with  optical  architectures.  We  employ 
a  feature  space  neuron  representation  space  to  utilize  a  reduced  number  of  input  neurons. 
Optical  NNs  [3,4]  can  conceptually  implement  the  multilayer  perceptron  [5]  neural  network 
architecture  (which  can  produce  complex  decision  surfaces),  but  these  require  advanced  optical 
materials  and  devices.  We  consider  an  optical  implementation  of  the  PQNN  in  which  the  use  of 
an  extra  input  neuron,  complex- valued  weights,  and  intensity  detectors  provides  piecewise  conic 
surfaces.  We  use  multilevel  phase  error  diffusion  [6]  to  produce  high  accuracy  complex-valued 
interconnection  weight.  The  use  of  complex-valued  wdghts  is  easily  possible  in  optics  but  has 
not  been  used  in  optical  NNs.  Most  optical  NNs  also  do  not  use  the  intensity  (magnitude 
squared)  nature  of  optical  detectors  for  their  nonlinearity  advantages.  The  resultant  PQNN 
thus  makes  use  of  specific  advantages  of  optics.  Because  of  the  complex-valued  weights  used 
in  the  PQNN,  it  is  directly  suited  for  an  optical  implementation. 
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2  Architecture 

Fig.  1  shows  the  basic  tl»Tcc  layer  NN  architecture  u.scd.  The  Ni  =  A’  +  1  input  I\  neurons 
are  analog  (the  ability  to  handle  analog  data  directly  is  anotlier  attractive  feature  of  an  optical 
or  analog  VLSI  NN).  The  first  N  neurons  at  arc  a  feature  space  description  of  the  inulticlass 
input  data  to  be  classified.  The  input  to  the  additional  (N  +  l)-th  neuron  is  always  equal  to 
unity.  This  neuron  allows  the  NN  to  adjust  the  center  of  each  hyperquadratic  discriminant 
function.  Our  feature  space  inputs  are  generally  unipolar.  If  other  bipolar  input  representation 
spaces  arc  used,  we  use  an  input  Pj  neuron  nonlinearity  to  produce  unipolar  input  neurons 

l  +  tanh(o:i)  .  . 

!<= - 5 -  (I) 

where  the  monotonic  sigmoid  nonlinearity  maintains  the  ordering  of  the  feature  space  data. 
The  input-to-hidden  layer  (Pj-to-Pj)  weights  are  complex-valued.  A  winner  take  all  (WTA) 
operation  is  applied  to  the  JVj  hidden  layer  neuron  outputs.  The  Pj  neurons  are  intensity 
sensitive  with  outputs 

Ms)  =  II  (2) 

3 

where  y  denotes  the  input  Pj  neuron  index,  t  is  the  Pj  hidden  layer  neuron  index,  and  xoij+jvij 
are  the  complex-valued  Pi-to-Pj  weights.  We  determine  a  number  of  prototypes  from  the 
training  set  of  data  and  from  these  determine  the  number  of  hidden  layer  neurons  to  be  used. 
Erom  the  locations  of  the  prototypes  (in  our  feature  space),  we  select  initial  quadratic  decision 
boundaries  (by  looking  at  sets  of  prototypes).  These  boundaries  define  initial  complex-valued 
Pj-to-Pa  weights  and  hence  inital  probability  density  clusters  (in  feature  space)  for  each  Pj 
neuron.  These  initial  weights  are  then  adapted  in  our  NN  training  algorithm  (Section  4). 
When  training  is  complete,  classiheation  is  performed  on  the  test  data.  In  testing,  the  most 
active  Pj  neuron  denotes  the  cluster  to  which  the  input  Pt  data  to  be  classified  belongs  (we 
allow  more  than  one  Pj  cluster  or  prototype  per  class).  The  Pt  outputs  are  then  mapped  with 
binary  wmghts  to  the  C  output  P3  neurons  (one  per  class,  where  there  are  C  classes  in  our 
multiclass  problem). 

3  Two  Dimensional  Quadratic  Surfaces 

To  show  that  this  architecture  produces  quadratic  decision  surfaces,  we  consider  the  sim¬ 
plified  NN  architecture  of  Fig.  2.  This  shows  a  two  class  problem  (two  output  P3  neurons) 
with  two  hidden  layer  Pt  neurons  and  a  two  dimensional  feature  space  (2  -F  1  =3  input  Pi 
neurons).  The  notation  used  is  shown  in  Fig.  2.  The  Pi-to-Pj  weights  are  complex- valued  and 
are  denoted  by 

=  Wij  +  jvij,  (3) 

w^ere  we  constrain  ||tn,yK  <  1.0.  Practical  optical  weights  must  satisfy  this  constraint  as  they 
are  passive.  This  requirement  on  the  weights  can  be  achieved  without  loss  of  generality  by 
calculating  the  optimal  weights  for  classification  and  then  applying  this  constraint  (as  a  scale 
factor).  The  outputs  from  the  two  hidden  layer  neurons,  denoted  by  /i(i)  and  /t{x),  where  x 
denotes  the  input  P|  neuron  values,  are 

/i(x)  =  llwiixi  -f  ibitxt  +  W13II’  (4) 
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Figure  1;  Piecewise  Quadratic  Neural  Network 
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Figure  2:  2-class,  2-feature  PQNN 


/aCx)  =  |l«>2lXi  +  1022X2  +  (5) 

Rewriting  the  complex-valued  weight  t&,j  as  in  (3),  the  hidden  layer  neuron  outputs  become 

/i(x)  =  +  Wii)  +  *2(W2  +  »?2)  +  2*iX2(«>n»Pi2  +  W11W12)  + 

2xi(TOiitoi3  -I-  uiiuia)  +  2i2(toi2Wi3  +  W12W13)  +  (w?3  +  ^13)  (6) 

/2(x)  =  +  W21)  +  *2(W22  +  »22)  +  2xil2(t02lt022  +  W21W22)  + 

2lj(t021«>23  +  t;21»23)  +  2l2(tD22tP23  +  W22W23)  +  («^  +  (7) 

To  describe  the  discriminant  surface  (discriminant  function)  between  classes  1  and  2,  we  equate 
/i(x)  Sind  /2(z)  and  obtain 

*?(«'?!  +  V?!  -  »21  -  »2l)  +  *2(«'l2  +  »12  “  «^2  "  + 

2ziZ2(tDiit0i2  -f  VnVl2  —  W21<P22  “  »2I®22)  + 

2zi(w]]U;i3  +  V11O13  -  1021 1033  -  V21W23)  + 

2X2(wi2tOi3  -J-  VJ3V13  —  IV22W23  —  V23V23)  -I- 

(*P?3  +  »13  —  W23  —  vjj)  =  0  (8) 

The  general  form  for  the  decision  surfaces  in  (8)  is 

azj -h  6z5  +  CZ1Z2 -h  dzi  +  ez2  +  /  =  0  (9) 

With  different  values  for  the  a,6,c,d,e,  and  /  coefficients,  different  surfaces  can  be  produced. 
For  example,  ifc  =  d  =  e  =  0  and  a  =  6,  we  obtain  a  circle  wth  radius  lfc  =  d  =  e  =  0 
and  a  7^  6,  we  obtain  an  ellipse;  if  a  =  6  =  c  =  0,  we  obtain  a  straight  line,  given  by 
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Figure  3:  Possible  decision  surfaces  for  the  two  cK-uss,  two-feature  problem 


X|  =  —[eld)x^  +  (ffd).  Similarly,  we  can  select  the  cm'llicicnts  to  produce  a  pajrabolic  or 
hyperbolic  surface. 

As  Eqs.  8  and  9  show,  we  can  produce  any  general  sccotui  order  discriminant  surface  through 
various  choices  of  the  interconnection  weights.  When  more  than  2  features  used,  these  become 
multidimensional  surfaces  and  decision  surfaces  become  picr  cwise  hyperquadratic  surfaces.  If 
a  given  problem  only  requires  a  hyperplanc,  the  weights  juiiomatically  accomodate  this  (by 
choosing  the  appropriate  real  and  imaginary  parts  of  the  weights  to  be  zero).  Therefore  this 
architecture  automatically  accomodates  any  lower  order  .surface.  Fig.  3  shows  the  five  basic 
decision  surfaces  possible  with  the  PQNN.  As  the  spcciiic  values  change,  the  parameters 
and  locations  of  the  various  quadratic  surfaces  change. 

4  Neural  Network  Algorithm 

In  the  synthesis  of  the  weights  during  training,  we  select  prototypes  of  each  class.  We  then 
select  initial  quadratic  decision  boundaries  between  sets -of  prototypes.  These  determine  our 
initial  complex- valued  Pi-to-Pj  weights.  We  then  use  NN  techniques  to  adapt  these  weights. 
We  do  this  by  deternuning  the  most  active  true  and  false  cljiss  P2  neurons  and  adapting  their 
weights  (after  each  presentation  of  all  training  samples)  uning  gradient  descent  optimization. 
This  is  attractive  as  it  uses  pattern  recognition  techniques  to  sdect  the  initial  weights  (these 
are  a  much  better  choice  and  closer  to  the  optimal  wdghts  than  the  typical  choice  of  a  random 
set  of  wdghts).  The  update  algorithm  then  allows  the  NN  to  combine  individual  quadratic 
dedsion  surfaces  into  the  final  piecewise  quadratic  decision  surfaces  used  in  classification. 

We  allow  more  than  one  prototype  per  class.  The  prototypes  chosen  impUdtly  contain  the 
dass  distribution  information  (the  dass  distributions  arc  not  cxplidtly  calculated,  but  are  used 
implidtly).  The  initial  wdghts  are  not  the  prototype  locations  in  feature  space  (as  they  were 
in  ACNN). 

5  Case  Study 

To  demonstrate  the  performance  of  our  PQNN  architecture  and  algorithiii  and  to  allows  its 
dedsiou  surfaces  to  be  visualized,  we  consider  a  2-dass,  2-rcature  example  (using  the  NN  in 
Fig.  2).  As  our  case  study,  we  generated  500  samples  of  each  of  two  dasses.  Fig.  4  shows  100  of 
the  samples  in  each  dass.  Discrimination  of  these  twodasMw  dearly  requires  nonlinear  decision 
boundaries.  We  used  only  two  hidden  layer  neurons  (one  [)cr  dass)  for  this  example,  but  more 
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tliiin  OHO  per  class  can  bo  used  if  necessary.  The  prototypes  defined  the  initial  r\-Xo- 1\  weights. 
These  were  adapted  in  50  iterations.  Fig.  5  sliows  the  resulting  decision  surface.  It  achieves 
=  97.10%  classification  accuracy.  In  this  ease  the  decision  surface  is  a  second  order  surface 
(a  circle).  This  is  a  true  circle  and  not  a  piecewise  approximation  to  it  (as  is  prod>iced  with 
the  ACNN  and  other  NNs).  The  algorithm  can  produce  any  quadratic  (second  order)  decision 
surface  (c.g.,  circle,  ellipse,  parabola,  hyperbola,  etc.)  as  well  as  lower  order  (linear)  decision 
surfaces  (lines  in  two  dimensional  feature  space  and  hyperplanes  in  higher  dimensional  feature 
spaces).  The  choice  is  automatic  in  the  algorithm  (as  it  selects  the  appropriate  non-zero  real 
and  imaginary  parts  of  each  interconnection  weight).  For  comparision,  the  ACNN  was  tested 
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Figure  4:  2-cIass,  2-feature  case  study  data  Figure  5:  Decision  boundaries  formed  by  the 
(’-l-’=aass  1,  ’*’=Class  2)  PQNN  and  ACNN 

with  different  numbers  of  Pj  neurons.  With  12  neurons,  piecewise  linear  decision  surfaces  were 
produced  and  this  gave  a  classification  accuracy  96.70%.  The  decision  surface  produced  is  also 
shown  in  Fig.  5.  The  number  of  Pi  neurons  required  in  the  ACNN  is  not  easily  determined 
except  by  extensive  tests  and  the  minimum  number  is  quite  critical.  The  PQNN  does  not  have 
this  disadvantage.  Other  NNs  can  produce  similar  decision  surfaces  at  the  expense  of  many 
more  Pj  neurons  and  a  larger  interconnection  mask.  Also,  selecting  the  number  of  Pj  neurons 
can  be  difficult.  Clearly,  the  PQNN  can  produce  exact  quadratic  decision  surfaces  (rather  than 
piecewise  ones)  and  can  achieve  this  much  more  easily  and  automatically. 

6  Discussion 

A  piecewise  quadratic  neural  network  (PQNN)  was  described  and  demonstrated.  It  achieves 
piecewise  quadratic  decision  surfaces  by  the  use  of  complex- valued  interconnection  weights 
and  intensity  detectors.  An  optical  realization  architecture  was  described  and  initial  results 
presented. 

The  PQNN  can  be  viewed  as  a  higher  order  NN.  However,  it  is  very  different  from  the 
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Gil<^s,  otal.  (7j  liigiicr  order  NN,  wlucU  used  liiglicr  order  neurons  (xiij,  etc.)  to  acliievc 
shifl-invariaiicc  at  tlic  cost  of  A'^  interconnections  (versus  our  use  of  complex- valued  weights 
and  square  law  detectors),  and  the  Psaltis  (8j  higher  order  NN  whicli  used  “higher  order’’  to 
refer  to  3-D  optical  volume  holographic  interconnections.  The  Ncocognitron  (9]  uses  many 
neuron  layers  to  achieve  shift-invariance  and  distortion-invariance  (a  form  of  higher  order  NN). 
We  use  a  feature  space  neuron  representation  space  to  achieve  shift  and  distortion-invariance 
(with  much  fewer  neurons  and  interconnections).  Our  higher  order  NN  is  intended  to  produce 
more  complicated  decision  surfaces  (using  only  2  or  3  neuron  layers  and  simple  matrix-vector 
operations,  rather  than  volume  holograms  or  multiple  neuron  layers). 
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Abstract 

Various  error  sources  (including  analog  accuracy,  nonlinearities,  and  noise)  are  present  in  all  neural  nets.  We 
consider  their  effects  in  training  and  testing  on  two  different  pattem  recognition  neural  nets.  We  show  that  the 
neural  itets  considered  allow  some  sudi  effects  to  be  included  inherently  in  the  neural  net  synthesis  algorithm  and 
that  the  effect  of  the  other  error  sources  can  be  'trained  out*  by  pn^}er  selection  of  neural  net  design  parameters. 
We  consider  multiclass  distortioa-invariant  pattern  recognition  neural  nets.  Our  results  are  applicable  to  analog 
VLSI  and  optical  neural  nets. 

1.  Introduction 

For  the  difficult  pattem  recognition  problems  conadered,  die  neural  nets  we  used  are  reviewed  in  Section  2 
together  with  the  error  sources  consideied  and  the  algorithmic  and  training  techniques  considered  to  overcome  these 
effects.  Test  results  are  iiKluded  in  Sections  3  and  4  and  our  conclusions  are  advanced  in  Section  S.  We  consider 
good  probability  of  correct  recognidon  (Pq%)  and  large  sttxage  c^adty  (handling  many  classes  M,of  objects  with 
few  neurons  N|)  for  distorted  objects  to  be  necessary  performance  goals. 

2,  Pattern  Reco2nition  Neural  Nets 

The  three  layer  neural  net  CNN)  architecture  of  Figure  1  is  considered  with  Nj  input  Pj  neurons,  hidden  layer 
P3  neurons  and  N^  s  C  (the  number  of  objea  classes)  ooqiutP^  neurons.  TheN|  input  neurons  are  a  feature  space 
rqiresentation  (wedge  sampled  magnitude  Fliurier  samples)  to  reduce  dimendonality,  training  set  size,  and  training 
time  [1].  The  number  of  neurems  Nj  is  selected  by  dusteiing  techniques  as  proto^pe  exemplars.  These  define 
initial  P|-to-P3  weights  which  are  then  itiapfei  in  training.  An  important  aqm  of  this  NN  is  the  use  of  a  perceptron 
criterkm  (error)  function  with  a  paratneter  S  (Figure  2).  Selection  of  S  achieves  generalization,  avoids  overtraining, 
and  is  of  use  when  noise  and  emr  effects  ate  ooosidaed.  Once  the  weights  have  been  designed,  (his  is  a  single-pass 
feed-forward  NN  in  dassificatioa  (recall)  with  no  iterations  used.  This  adaptive  clustering  NN  (ACNN)  has  been 
detailed  earlier  {2]. 


The  P]-u>-p3  operations  are  a  matrix-vector  multiplication  implemented  by  a  matrix  M  at  plane  P2  operating  on 


ail  iiipui  Pj  neuron  vector  x  to  yield  ihc  Pj  vector  y  of  neurons.  We  can  address  die  storage  capacity  of  the  system 
using  associative  processor  (AP)  terminology  where  tiic  P,  inputs  arc  keys,  the  P3  vector  (length  K)  is  a  recollection 
and  tlic  storage  M  is  tlic  number  of  key/rocol lection  vector  pairs  stored.  If  each  x  input  is  assumed  to  be  associated 
witli  a  different  y  output  at  P3,  then  M  =  2N|  is  the  largest  storage  possible  and  tliis  can  be  achieved  only  with  a 
Ho-Kashyap  (HK)  algoritlim  to  compute  the  weights  M  as  detailed  elsewhere  [3].  A  robust  HK  algorithm  (3)  is  of 
particular  use  in  our  present  accuracy,  noise  and  error  source  considerations.  This  uses  =  Y  +  McT"!)  * 

as  the  initial  set  of  matrix  weights  in  the  synthesis  algorithm  where  X  and  Y  are  matrices  with  all  x  and  y  vectors  as 
their  columns.  When  the  parameter  is  varied,  this  algorithm  has  been  shown  to  provide  best  performance  for 
input  noise  with  variance  We  have  also  related  this  a  value  to  the  an^dog  accuracy  2®  of  a  B-bit  neural  net  as 

<t  =  2®/{12)’'2  (1) 

as  we  will  show.  Thus,  this  technique  has  a  sound  theoretical  basis  [3]. 

The  eaor  sources  we  consider  aic  listed  in  Table  1  with  respect  to  the  data  plane  Pj-io-Pj  they  affect  Some 
entries  require  discussion.  We  consider  a  piecewise  nonlinear  error  model  for  the  input  neurons  (versus  the  ideal 
linear  function)  and  we  consider  minimum  (non-zero)  off  levels  for  Pj  and  Pj  (due  to  light  leakage  in  an  optical 
system,  etc.).  The  6  bit  input  neuron  accuracy  assumed  is  typical  of  a  1%  analog  system.  The  P2  accuracy  in  the 
neuron  weights  is  B  =  6  bits  in  standard  analog  VLSI  (the  same  as  for  the  Pj  input  neurons)  but  can  easily  be  higher 
(12  bits)  in  an  optical  system  (since  a  fixed  film  mask  with  accurate  encoding  can  be  used).  Advanced  analog  Vl-SI 
techniques  can  provide  10  bit  Pj  and  P2  accuracies  (at  an  increase  in  cost)  while  6  bits  is  typical  of  standard 
fabrication  methods.  Error  sources  we  found  to  be  negligible  are  noted  by  an  asterisk. 

3.  Ho-Kashyap  Storage  Capacity  M/N  Test  Results 

For  our  H-K  algorithm  tests,  we  use  Nj  *  N  *  16  input  neurons  with  random  analog  values  and  N3  *  K  =  8 
binary  neurons  at  Pj.  The  maximum  possible  storage  (M)  for  any  NN  is  M  =  2N  or  32  for  our  case.  Thus,  we  now 
consider  the  perfonnance  P^*  of  H-K  NNs  that  give  M  >  N  storage  for  various  designs  (a  values)  with  different 
error  sources  and  bit  accuracies  present 

We  first  show  in  Table  2  that  use  of  a  in  our  robust  NN  provides  better  P(-%  results.  In  Test  1.  the  standard  NN 
with  the  weights  calculated  with  infinite  accuracy  data  gave  P^  =  823%  correct  classifications  of  input  data  when 
tested  with  a  large  <j  =  0.1  amount  of  noise  added  to  the  input  neuron  values  during  tests.  When  the  input  data  was 
quantized  to  6  bits,  noise  tests  gave  similar  83%  results.  However,  agnificantly  better  noise  test  results  occurred 
(P^  ■»  91%)  when  the  NN  weights  were  syntheszed  with  a  =  0.1  uang  our  robust  algorithm.  Thus,  our  robust 
algorithm  provides  better  results  in  noise  and  with  limited  accuracy.  Much  better  P^  occurs  CTests  2-4)  when  the 
NN  is  tested  with  only  limited  accuracy  inputs  and  weights  (6-14  bits)  without  the  large  a  s  0.1  amount  of  input 
noise  present  and  with  a  s  0.004S  (for  6  bits),  o  s  0.0028  (for  10  bits),  etc.  calculated  fiom  (1)  used  in  synthesis  of 
the  NN.  Thus,  our  robust  NN  algorithm  gives  excellent  optimum  results  for  limited  input  and  wdght  accuracies  with 
M  >  N  (P(.%  degrades  as  storage  M^  increases  as  expected).  We  note  (Tests  2-4)  that  no  improvement  in 
performance  result  as  the  number  of  bits  of  accuracy  B  was  increased  from  10tol4bits. 

We  then  ran  a  number  of  NNs  different  corobinadoos  of  the  error  sources  in  Table  1  present  and  found  that 
the  major  error  sources  were  the  input  and  weight  aocurades  and  the  nonlinear  input  neuron  curve  with  the 
nonlinear  input  neuron  curve  being  die  major  error  source  (as  is  expected  with  analog  input  neurons).  We  then 
analyzed  various  hidden  layer  P3  encoding  schemes  to  determine  which  gave  the  best  P^  when  the  nonlinear  input 
neuron  errors  were  present  We  found  that  L-max  hidden  layer  neuron  encoding  (4]  was  best  (in  L-max  encoding, 
the  L  =  2  most  active  P3  neurons  ate  found  by  a  WTA  and  which  two  are  activated  drones  the  output  P^  class 
neuron  activated'^.  For  M  »  16  stored  vector  pairs  (M/N  =  1)  we  achieved  excellent  P^  =  98.8%  results  and  a  lower 
P^  =  90.5%  with  more  storage  (M  *=  20  or  M/N  =  1375).  We  then  quantified  the  input  and  weight  accuracy  required 


for  storage  of  M  =  18  vector  pairs.  We  found  Uiat  D  >  4  bit  input  accuracy  and  B  >  8  or  10  bit  weicht  accuracy 
was  sufftcient.  When  nonlinear  Pj  neuron  cnors  were  present,  8  bit  weights  sufficed  and  without  nonlinear  Pj 
errors  no  improvement  occurred  for  S  10  bit  weights  and  P^^  approached  100%.  Thus,  NNs  require  more  weicht 
accuracy  than  input  accuracy  and  hence  optical  NNs  have  an  advantage  over  analog  VLSI  implementations. 

4,  ADAPTIVE  CLUSTERING  NN  TEST  RESULTS 

We  then  tested  the  ACNN  design  in  Figure  1  for  a  muiticlass  distortion  invariant  pattern  recognition  problem 
involving  3  classes  of  aircraft  with  severe  ±90®  distortions  in  roll,  pitch  and  yaw.  The  NN  was  trained  on  1890 
distorted  inputs  and  tested  on  1734  distorted  inputs  not  present  in  the  training  set  We  used  Nj  =  16  input  neurons. 
N4  =  C  =3  output  neurons  (the  number  of  object  classes)  and  generally  N3  =  8  hidden  layer  neurons.  We  varied  the 
NN  parameter  S  (Figure  2)  and  the  training  procedure  to  achieve  the  best  Pc%  when  the  major  error  sources  were 
present  (input  and  weight  accuracies  and  nonlinear  input  neurons).  In  our  data,  we  also  note  the  minimum  distance 
separation  S'  between  the  cluster  centers  of  the  N3  prototype  hidden  layer  neurons  and  how  this  affects  the 
parameter  S  used  in  NN  synthesis. 

We  first  calculated  the  weights  using  a  full  32-bit  accuracy  digital  processor  and  then  tested  the  NN  with  full 
accuracy  inputs  and  weights,  with  only  the  inputs  quantized  to  different  numbers  of  bits,  and  with  only  the  weights 
quantized.  Similar  results  occurred  for  both  cases  and  for  the  training  and  lest  data.  The  S  =f  0.063  data  in 
Table  3  shows  averaged  results  that  indicate  that  4  bit  input  and  weight  accuracies  give  little  degradation  in  Pq%.  In 
these  data,  we  used  S  >  S'  =  0.030  since  one  might  assume  that  a  larger  S  forced  separation  would  improve  results. 
This  is  not  the  case  since  the  S  =  S'  =  0.030  data  in  Table  3  shows  better  results.  We  attribute  this  to  the  £aci 
that  a  smaller  S  value  allows  decision  boundaries  to  be  closer  to  data  clusters  and  that  quantizLig  input  data  (after 
the  weights  have  been  calculated  with  high  accuracy  data)  does  not  significantly  move  the  data  from  the  original 
clusters.  Thus,  we  select  S  =  S'  in  our  NN  design. 

In  Table  3,  the  perfoimance  is  excellent,  but  we  still  observe  a  loss  in  Pc%  as  the  input  accuracy  in  testing  is 
reduced  (Pc%  decreases  from  94.1%  to  91.6%  as  P|  decreases  from  10  to  4  bits).  To  obtain  impro>^_results.  we 
trained  the  neural  net  on  quantized  inputs.  We  also  calculated  new  N3  prototype  neuron  inputs  and  a  new  S  =  S' 
value  with  these  quantized  inputs  (in  other  words,  full  training  was  done  on  the  low  accuracy  neural  net  to  be  used  in 
testing).  Table  4  shows  our  results  using  inputs  and  wei^its  with  only  4  bits  of  accuracy.  The  new  S  =  S'  value  was 
0.070.  As  seen,  the  system  with  fuO  accuracy  gave  P^*  91 .3%.  When  these  calculated  weights  were  quantized  to  4 
bits  performance  degraded  by  2.8%  to  P^  »  88.7%.  However,  when  training  vms  performed  on  the  low  accuracy  4 
bit  system  we  obtained  even  better  perfonnanoe  (P^  953%)  than  in  the  original  system.  Similar  uends  were 

obtained  in  all  cases  tested,  qearly.  this  NN  allows  NN  accuracy  effects  to  be  trained  out. 

Similar  results  (Table  5)  were  obtained  when  input  neuron  nonlinearity  (NL)  errors  were  considered.  Tests  with 
this  enw  source  gave  poor  (Pq  =  773%)  results  while  training  the  NN  with  this  error  source  present  resulted  in 
greatly  improved  (P^  =  93.7%)  test  results.  Thus,  nonlinearenof  sources  can  also  be  trained  out 

With  a  low  accuracy  NN,  we  expect  more  hidden  layer  neurons  N3  to  improve  Pj;%  (itwre  prototypes  or  data 

clusters  are  preferable  since  their  separation  decreases  with  accuracy)  but  if  N3  is  made  too  large,  local  minima  and 

increased  training  time  result  and  S'  decreased.  TaUe  6  shows  that  improved  Pc%  results  as  N3  is  increased  from  8 

to  20  for  a  32  bit  accurate  system  crests  1  and  2)  and  for  a  6-bU  accurate  system  (Tests  3  and  4)  with  S  =  S'  used  in 
all  cases.  Thus,  increasing  the  number  of  hidden  layer  neurons  improves  the  performance  of  low  accuracy  NNs. 
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Nonlinear  uansfer  function 
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Accuracy  (6  bits) 

* 

Pi 

Off  (zero)  minimum  level 

Pz 

Off  (zero)  minimum  level 

Pa 

Accuracy  (6-12  bits) 

* 

Pa 

Nonuniform  beam  collimation 

* 

Pa 

Spatial  (bias)  and  temporal  (shot)  noise 

TABLE  1:  Optical  System  EnY>r  Sources 


TEST 

B-BIT 

Pc% 

Pc% 

REMARKS 

ACCURACY 

STD. 

ROBUST 

NN 

NN 

1 

•obits 

82.3 

90.6 

High  <T= 0.1 

6  bits 

83.0 

913 

noise 

TEST 

B-BIT 

M/N 

Pc% 

REMARKS 

ACCURACY 

on 

6  bits 

1 

983 

o  equal 

10-14  bits 

1 

100 

to 

6  bits 

1.125 

98.6 

bit 

H 

10-14  bits 

1.125 

99.4 

accuracy 

6  bits 

1375 

98.0 

■■ 

10-14  bits 

1 

993 

TABLE  2:  Tests  showing  o  improves 
for  a  s  6-bit  value  with  hi^o  =  0.1  noise 
(Test  1)  and  vnth  =  2-®/(12)W  (Tests  2-4) 


S  =  0.030 

S  =  0.063 

INPUT 

BITS 

Pc% 

94.12 

84.72 

94.12 

84.37 

92.73 

8436 

91.58 

8333 

83.79 

7633 

63.73 

53.06 

1 

57.84 

55.02 

TABLES:  Selecting S  =  S' =  0.30 
yields  better  with  input  Pj  quantization 


Full 

Accuracy 
Training 
(32  Bits) 

Quantize 

FuU 

Accuracy 
Weights 
to  4  Bits 

Train  on 

4  Bit 
Inputs 

Pc% 

(training) 

91.3% 

88.6% 

95.1% 

(testing 

91.6% 

88.8% 

95.4% 

TABLE  4:  Improve  Pq  by  training 
on  a  low  accuracy  NN 


Scenario 

P'c% 

cresting) 

Test  with  NL  input 
Train  with  NL  input 

TJ2% 

93.7% 

TABLES:  Train-Out  Input  Neuron 
Nonlinearities  to  Improve  P(-% 


5.  SUMMARY 

We  have  showed  that  two  advanced  pattern  recognition  neuial  nets  achieve  excellent  pattern  recognition 
performance  with  various  analog  VLSI  and  optical  newal  net  error  sources.  The  HK  neural  net  allows  the  best 
peifonnance  and  storage  in  noise  and  vnth  a  limited  accuracy  NN.  WTA  hidden  layer  neurons  are  preferable  when 
rKMilinearity  errors  are  present  We  can  training  out  limited  neuron  and  weight  accuracies  and  analog  neuron 
nonlinearities.  The  advanced  neural  nets  noted  and  ptrqter  use  of  their  synthesis  parameters  (o,  S.  number  of  hidden 
neurons  equal  to  number  of  data  clusters)  allows  these  attractive  enw  source  properties. 
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CHAPTER  7:  SIMULATION  AND  OPTICAI.  LABORATORY  NEURAL  NET 
RESULTS 

7.1  ERROR-FREE  SYSTEM  TESTS 

We  conducted  two  error-free  synthetic  case  studies  and  an  aircraft  case  study  comparing  our 
new  PQNN  to  other  classifiers.  Our  first  case  study  was  a  2-cIass  synthetic  data  set  with  3  input 
neurons  (features),  w'ith  500  samples  in  class  1  and  500  in  cla.ss  2.  We  tested  various  classifiers 
with  different  numbers  of  hidden  layer  neurons  N^.  Table  1  lists  our  results.  We  see  that  all 
classifiers  (except  exemplars)  perfonn  well  when  N3  is  chosen  large  enough  (and  with  the  proper 
S  value).  The  key  risult  is  that  our  PQNN  perfomis  approximately  best  for  any  N3  choice  and 
especially  for  smaller  N3  values. 


N3 

S 

Pc  (%) 

Fxpmplars 

ACNN 

Gaussian 

BP 

PQNN 

B 

0.25179 

56.2 

62.8 

99.2 

80.3 

99.9 

H 

0.13190 

66.5 

62.4 

95.1 

80.7 

96.9 

H 

0.04171 

60.1 

80.6 

95.4 

99.6 

98.5 

H 

0.04171 

68.9 

80.8 

97.8 

99.5 

98.8 

D 

0.01042 

68.4 

93.4 

97.8 

99.8 

99.5 

■ 

0.01042 

66.3 

93.3 

97.8 

99.5 

99.1 

8 

0.01042 

69.6 

96.5 

98.1 

99.6 

98.7 

0.01042 

77.3 

98.8 

97,9 

99.7 

98.6 

0.00502 

80.5 

98.9 

98.3 

99.5 

99.2 

Table  1:  P^  'S,  N3  (2  Class  Case  Study) 


Figure  1  shows  the  original  2  class  data.  Figure  2  shows  the  different  decision  surfaces 
produced  with  the  different  cla.ssifiers  with  N3  =  2.  Clearly,  the  PQNN  performs  best  (the  Gaussian 
classifier  performs  similarly,  since  the  data  is  Gaussian  distributed). 

Table  2  shows  similar  results  for  our  .second  synthetic  data  .set.  This  involved  a  4-class  problem 
(with  5000  total  samples;  1020,  135,  1439  and  2406  samples  in  each  cla.ss).  We  use  N]  =  3  input 
neurons  (two  features).  The  PQNN  performs  best  for  any  N3  and  its  advantage  in  P^  is  larger  for 
smaller  N3.  Figure  3  shows  the  original  data  and  Figure  4  shows  the  decision  boundaries  produced 
with  N3  =  5  hidden  layer  neurons. 


s 


0.04503 

0.04503 

0.03154 

0.02380 

0.02380 

0.00935 

0.00935 

0.00935 

0.00032 

0.00032 

0.00032 

0.00032 


Pc  (%) 

Exemplars 

ACNN 

Gaussian 

BP 

PQNN 

43.1 

74.9 

85.9 

92.4 

96.8 

51.5 

74.7 

85.7 

94.6 

95.2 

62.2 

75.0 

83.5 

94.8 

96.4 

67.8 

75.7 

91.0 

94.5 

95.2 

69.4 

94.2 

91.3 

96.0 

96.1 

69.9 

94.3 

92.0 

94.8 

96.7 

69.8 

94.4 

90.6 

95.2 

96.8 

72.5 

94.4 

88.4 

95.6 

94.9 

71.9 

94.8 

91.5 

95.8 

95.3 

72.3 

93.9 

93.7 

95.4 

96.9 

73.8 

95.0 

93.6 

95.4 

97.0 

74.3 

93.0 

94.0 

95.0 

96.6 

Table  2:  vs.  N3  (4  Class  Case  Study) 


(a)  PQNN  (b)  ACNN 


(c)  Backprop  (d)  Gaussian 

Figure  2:  Decision  Boundaries  for  2  Class  Case  Study  for  N3  =  2  hidden  layer  neurons. 


Figure  3:  4  Class  Case  Study  Data 


We  also  considered  an  aircraft  database  of  128x128  binary  synthetic  images  of  three  aircraft 
(F-4,  F- 104,  and  DC-IOV  630  images  of  each  aircraft,  representing  different  azimuth  and  elevation 
views,  were  used  as  training  images.  The  azimuth  angles  range  from  -85®  to  +85°  in  5°  increments, 
while  the  elevation  angles  span  0°  to  90°  in  5°  increments.  There  are  .578  test  set  images  per 
aircraft  type  at  2.5°  intermediate  angles.  The  features  used  are  invariant  to  image  translation.  As 
inputs  to  the  classifiers,  we  use  1 5  wedge-sampled  Fourier  transform  features.  The  wedge  samples 
are  normalized  so  that  they  provide  scale  and  shift  invariance.  The  performance  versus  N3 
showed  that  the  PQNN  is  again  best. 


(a)  PQNN 


(b)  ACNN 


(c)  Backprop 


(d)  Gaussian 


Figure  4:  Decision  Boundaries  for  4  Class  Case  Study  for  N3  =  5  hidden  layer  neurons. 


12  OPTICAL  LARORATORV  SYS TKM  NEURAL  NET  TESTS 

We  performed  optical  laboratory  tests  on  our  two-class  synthetic  and  three-class  aircraft  case 
studies  and  compared  results  to  those  obtained  by  simulations.  In  all  cases,  we  consider  only  real 
weights.  This  completes  Tasks  7  and  8  concerning  the  ACNN  and  PQNN.  Table  1  lists  our  results. 
For  these  2  case  studies,  we  show  the  optical  laboratory  data  (Test  1)  and  that  they  agree  with  the 
simulation  results  when  all  errors  were  present  and  trained  out  (assuming  5-bit  LCTV  input  neuron 
accuracy).  Thus,  the  validity  of  our  simulator  and  our  error  source  models  are  verified,  as  is  our 
training  out  algorithm.  If  6-bit  input  neuron  accuracy  were  available,  better  Pq  would  result  as 
shown  in  Test  3  versus  Te.st  4  (51.2%  versus  56.8%  and  90.5%  versus  88.7%).  For  comparison, 
we  list  the  Pc  obtained  with  an  ideal  system  (Test  2)  and  see  that  it  is  not  significantly  better  than 
the  results  we  obtained.  Thus,  our  present  neural  net  is  nearly  the  best  possible  and  is  quite  useful. 
Our  training  out  algorithm  clearly  provides  much  better  Pq  results. 


Test 

2  Class 
Aircraft 
Case 
Study 

3  Class  Aircraft  Case 
Study 

Test 

Train 

n 

Optical  lab 

56.5 

88.3 

86.1 

2 

Ideal  (no  errors) 

62.8 

84.7 

83.6 

3 

All  errors  present 

Trained  out  (6-bit  input  neurons) 

57.2 

90.5 

91.1 

4 

All  errors  present. 

Trained  out  (5-bit  input  neurons) 

56.8 

88.7 

86.4 

Table  3:  results  lor  the  PQNN  with  real  weights  for  several  case  studies  using  the  optical 
laboratory  system,  an  ideal  system  and  systems  with  different  levels  of  input  accuracy  using 

our  training  output  algorithm. 

7.3  OPTICAL  LABORATORY  HIGH  CAPACITY  HK  NN  TKSTS 

We  used  our  1:1  Ho-Kashyap  neural  net  to  demonstrate  its  high  storage  and  its  superior 
performance.  The  system  parameters  are  given  in  Table  4.  The  error  sources  considered  are  given 
in  Table  5.  We  trained  out  all  errors  except  errors  8  and  9.  We  corrected  for  errors  5  and  6.  No 
input  noise  was  present.  The  o^yn  control  parameter  used  (Ogyn  =  0.00025)  corresponded  to  10-bit 
accuracy. 

Simulations  predicted  a  recall  accuracy  of  P’c  =  98. 17%  (average  accuracy  of  100  random  sets 
of  input  data)  for  a  neural  net  with  the  parameters  as  listed  in  Tables  4  and  5.  For  the  exact  neural 


4 


\ 


net  run  in  the  optical  lab  with  the  specific  M  =  24  input  vectors  used,  simulations  (with  only  one 
run)  gave  P'c  =  100%.  We  ran  the  system  in  the  laboratory  and  achieved  P'c  =  23/24  =  96%.  This 
corresponds  to  only  one  error  in  the  24  output  vectors  and  the  single  output  vector  that  was 
incorrect  had  only  one  of  its  output  elements  wrong.  We  attribute  this  error  to  the  space-variant 
beam  profile  which  is  only  approximately  modeled  by  a  centered  Gaussian.  This  is  excellent 
storage  performance  density  (more  than  any  other  neural  net)  and  our  simulations  and  optical  lab 
results  match  well  and  the  use  of  our  training  out  algorithm  is  verified. 


PARAMETER 

VALUE 

DESCRIPTION 

N 

16 

Input  vector  dimension  (unipolar  elements) 

K 

8 

Output  vector  elements  (bipolar  elements) 

M 

24  (M/N=  1.5) 

Vector  pairs  stored 

L 

3 

L-max  output  vector  encoding  parameter 

0.0 

No  input  noise  was  added  to  key  vectors 

Table  4:  AP  system  parameters 


Error 

Location 

Error  Description 

Parameter 

Value 

Significant 

Error? 

1 

Pi 

Nonlinear  input  device  characteristics 

YES 

2 

Pi 

Input  accuracy 

5  bits 

YES 

3 

Pi 

Off  (zero)  minimum  level 

0.0006 

NO 

H 

P2 

Mask  accuracy 

8  bits 

YES 

D 

P2 

Mask  nonlinear  device  characteristics 

YES 

H 

P2 

Gaussian  beam  taper 

15% 

YES 

D 

P2 

On  (1,0)  maximum  level 

0.931 

YES 

8 

P3 

Detector  precision 

10  bits 

YES 

9 

P3 

Detector  temporal  shot  noise  variance 

lO*"^ 

NO 

Table  5:  Optical  system  error  sources  summary 


